date:20180729

Re: [PATCH v5 bpf-next 6/9] bpf: Make redirect_info accessible from modules

2018-07-29 Thread kbuild test robot

Hi Toshiaki,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:
https://github.com/0day-ci/linux/commits/Toshiaki-Makita/net-Export-skb_headers_offset_update/20180729-094722
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   net/core/filter.c:116:48: sparse: expression using sizeof(void)
   net/core/filter.c:116:48: sparse: expression using sizeof(void)
   net/core/filter.c:210:32: sparse: cast to restricted __be16
   net/core/filter.c:210:32: sparse: cast to restricted __be16
   net/core/filter.c:210:32: sparse: cast to restricted __be16
   net/core/filter.c:210:32: sparse: cast to restricted __be16
   net/core/filter.c:210:32: sparse: cast to restricted __be16
   net/core/filter.c:210:32: sparse: cast to restricted __be16
   net/core/filter.c:210:32: sparse: cast to restricted __be16
   net/core/filter.c:210:32: sparse: cast to restricted __be16
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:237:32: sparse: cast to restricted __be32
   net/core/filter.c:410:33: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:413:33: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:416:33: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:419:33: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:422:33: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:495:27: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:498:27: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:501:27: sparse: subtraction of functions? Share your drugs
   include/linux/slab.h:631:13: sparse: undefined identifier 
'__builtin_mul_overflow'
   include/linux/slab.h:631:13: sparse: not a function 
   include/linux/filter.h:640:16: sparse: expression using sizeof(void)
   include/linux/filter.h:640:16: sparse: expression using sizeof(void)
   include/linux/filter.h:640:16: sparse: expression using sizeof(void)
   include/linux/filter.h:640:16: sparse: expression using sizeof(void)
   net/core/filter.c:1389:39: sparse: incorrect type in argument 1 (different 
address spaces) @@expected struct sock_filter const *filter @@got 
struct sockstruct sock_filter const *filter @@
   net/core/filter.c:1389:39:expected struct sock_filter const *filter
   net/core/filter.c:1389:39:got struct sock_filter [noderef] *filter
   include/linux/filter.h:640:16: sparse: expression using sizeof(void)
   include/linux/filter.h:640:16: sparse: expression using sizeof(void)
   net/core/filter.c:1491:39: sparse: incorrect type in argument 1 (different 
address spaces) @@expected struct sock_filter const *filter @@got 
struct sockstruct sock_filter const *filter @@
   net/core/filter.c:1491:39:expected struct sock_filter const *filter
   net/core/filter.c:1491:39:got struct sock_filter [noderef] *filter
   include/linux/filter.h:640:16: sparse: expression using sizeof(void)
   net/core/filter.c:1824:43: sparse: incorrect type in argument 2 (different 
base types) @@expected restricted __wsum [usertype] diff @@got unsigned 
lonrestricted __wsum [usertype] diff @@
   net/core/filter.c:1824:43:expected restricted __wsum [usertype] diff
   net/core/filter.c:1824:43:got unsigned long long [unsigned] [usertype] to
   net/core/filter.c:1827:36: sparse: incorrect type in argument 2 (different 
base types) @@expected restricted __be16 [usertype] old @@got unsigned 
lonrestricted __be16 [usertype] old @@
   net/core/filter.c:1827:36:expected restricted __be16 [usertype] old
   net/core/filter.c:1827:36:got unsigned long long [unsigned] [usertype] 
from
   net/core/filter.c:1827:42: sparse: incorrect type in argument 3 (different 
base types) @@expected restricted __be16 [usertype] new @@got unsigned 
lonrestricted __be16 [usertype] new @@
   net/core/filter.c:1827:42:expected restricted __be16 [usertype] new
   net/core/filter.c:1827:42:got unsigned long long [unsigned] [usertype] to
   net/core/filter.c:1830:36: sparse: incorrect ty

Re: [patch net-next] net: sched: don't dump chains only held by actions

2018-07-29 Thread Jiri Pirko

Sat, Jul 28, 2018 at 07:39:36PM CEST, xiyou.wangc...@gmail.com wrote:
>On Sat, Jul 28, 2018 at 10:20 AM Cong Wang  wrote:
>>
>> On Fri, Jul 27, 2018 at 12:47 AM Jiri Pirko  wrote:
>> >
>> > From: Jiri Pirko 
>> >
>> > In case a chain is empty and not explicitly created by a user,
>> > such chain should not exist. The only exception is if there is
>> > an action "goto chain" pointing to it. In that case, don't show the
>> > chain in the dump. Track the chain references held by actions and
>> > use them to find out if a chain should or should not be shown
>> > in chain dump.
>> >
>> > Signed-off-by: Jiri Pirko 
>>
>> Looks reasonable to me.
>>
>> Acked-by: Cong Wang 
>
>Hold on...
>
>If you increase the refcnt for a zombie chain on NEWCHAIN path,
>then it would become a non-zombie, this makes sense. However,
>if the action_refcnt gets increased again when another action uses it,
>it become a zombie again because refcnt==action_refcnt??

No. action always increases both refcnt and action_refcnt

[PATCH rdma-next 08/10] IB/ipoib: Do not remove child devices from within the ndo_uninit

2018-07-29 Thread Leon Romanovsky

From: Jason Gunthorpe 

Switching to priv_destructor and needs_free_netdev created a subtle
ordering problem in ipoib_remove_one.

Now that unregister_netdev frees the netdev and priv we must ensure that
the children are unregistered before trying to unregister the parent,
or child unregister will use after free.

The solution is to unregister the children, then parent, in the same batch
all while holding the rtnl_lock. This closes all the races where a new
child could have been added and ensures proper ordering.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/ipoib/ipoib.h  |  7 +++
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 28 +---
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c |  6 ++
 3 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 804cb4bee57d..1abe3c62f106 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -330,6 +330,13 @@ struct ipoib_dev_priv {
 
unsigned long flags;
 
+   /*
+* This protects access to the child_intfs list.
+* To READ from child_intfs the RTNL or vlan_rwsem read side must be
+* held.  To WRITE RTNL and the vlan_rwsem write side must be held (in
+* that order) This lock exists because we have a few contexts where
+* we need the child_intfs, but do not want to grab the RTNL.
+*/
struct rw_semaphore vlan_rwsem;
struct mutex mcast_mutex;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index e9f4f261fe20..b2fe23d60103 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1939,18 +1939,15 @@ static int ipoib_ndo_init(struct net_device *ndev)
 
 static void ipoib_ndo_uninit(struct net_device *dev)
 {
-   struct ipoib_dev_priv *priv = ipoib_priv(dev), *cpriv, *tcpriv;
-   LIST_HEAD(head);
+   struct ipoib_dev_priv *priv = ipoib_priv(dev);
 
ASSERT_RTNL();
 
-   /* Delete any child interfaces first */
-   list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) {
-   /* Stop GC on child */
-   cancel_delayed_work_sync(&cpriv->neigh_reap_task);
-   unregister_netdevice_queue(cpriv->dev, &head);
-   }
-   unregister_netdevice_many(&head);
+   /*
+* ipoib_remove_one guarantees the children are removed before the
+* parent, and that is the only place where a parent can be removed.
+*/
+   WARN_ON(!list_empty(&priv->child_intfs));
 
ipoib_neigh_hash_uninit(dev);
 
@@ -2466,16 +2463,25 @@ static void ipoib_add_one(struct ib_device *device)
 
 static void ipoib_remove_one(struct ib_device *device, void *client_data)
 {
-   struct ipoib_dev_priv *priv, *tmp;
+   struct ipoib_dev_priv *priv, *tmp, *cpriv, *tcpriv;
struct list_head *dev_list = client_data;
 
if (!dev_list)
return;
 
list_for_each_entry_safe(priv, tmp, dev_list, list) {
+   LIST_HEAD(head);
ipoib_parent_unregister_pre(priv->dev);
 
-   unregister_netdev(priv->dev);
+   rtnl_lock();
+
+   list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs,
+list)
+   unregister_netdevice_queue(cpriv->dev, &head);
+   unregister_netdevice_queue(priv->dev, &head);
+   unregister_netdevice_many(&head);
+
+   rtnl_unlock();
}
 
kfree(dev_list);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 
b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index 891c5b40018a..fa4dfcee2644 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -67,6 +67,12 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct 
ipoib_dev_priv *priv,
 
ASSERT_RTNL();
 
+   /*
+* Racing with unregister of the parent must be prevented by the
+* caller.
+*/
+   WARN_ON(ppriv->dev->reg_state != NETREG_REGISTERED);
+
priv->parent = ppriv->dev;
priv->pkey = pkey;
priv->child_type = type;
-- 
2.14.4

[PATCH rdma-next 02/10] IB/ipoib: Use cancel_delayed_work_sync for neigh-clean task

2018-07-29 Thread Leon Romanovsky

From: Erez Shitrit 

The neigh_reap_task is self restarting, but so long as we call
cancel_delayed_work_sync() it will be guaranteed to not be running and
never start again. Thus we don't need to have the racy
IPOIB_STOP_NEIGH_GC bit, or the confusing mismatch of places sometimes
calling flush_workqueue after the cancel.

This fixes a situation where the GC work could have been left running
in some rare situations.

Signed-off-by: Erez Shitrit 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 7ca9013bf05c..7cd42619deb2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1661,7 +1661,7 @@ static void ipoib_neigh_hash_uninit(struct net_device 
*dev)
/* Stop GC if called at init fail need to cancel work */
stopped = test_and_set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
if (!stopped)
-   cancel_delayed_work(&priv->neigh_reap_task);
+   cancel_delayed_work_sync(&priv->neigh_reap_task);
 
ipoib_flush_neighs(priv);
 
@@ -1837,7 +1837,7 @@ void ipoib_dev_cleanup(struct net_device *dev)
list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) {
/* Stop GC on child */
set_bit(IPOIB_STOP_NEIGH_GC, &cpriv->flags);
-   cancel_delayed_work(&cpriv->neigh_reap_task);
+   cancel_delayed_work_sync(&cpriv->neigh_reap_task);
unregister_netdevice_queue(cpriv->dev, &head);
}
unregister_netdevice_many(&head);
@@ -2346,7 +2346,7 @@ static struct net_device *ipoib_add_port(const char 
*format,
flush_workqueue(ipoib_workqueue);
/* Stop GC if started before flush */
set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
-   cancel_delayed_work(&priv->neigh_reap_task);
+   cancel_delayed_work_sync(&priv->neigh_reap_task);
flush_workqueue(priv->wq);
ipoib_dev_cleanup(priv->dev);
 
@@ -2412,7 +2412,7 @@ static void ipoib_remove_one(struct ib_device *device, 
void *client_data)
 
/* Stop GC */
set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
-   cancel_delayed_work(&priv->neigh_reap_task);
+   cancel_delayed_work_sync(&priv->neigh_reap_task);
flush_workqueue(priv->wq);
 
/* Wrap rtnl_lock/unlock with mutex to protect sysfs calls */
-- 
2.14.4

[PATCH rdma-next 10/10] IB/ipoib: Consolidate checking of the proposed child interface

2018-07-29 Thread Leon Romanovsky

From: Jason Gunthorpe 

Move all the checking for pkey and other validity to the __ipoib_vlan_add
function. This removes the last difference from the control flow
of the __ipoib_vlan_add to make the overall design simpler to
understand.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Erez Shitrit 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/ipoib/ipoib_netlink.c |  3 --
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c| 77 +++-
 2 files changed, 52 insertions(+), 28 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_netlink.c 
b/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
index 7e093b7aad8f..d4d553a51fa9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
@@ -122,9 +122,6 @@ static int ipoib_new_child_link(struct net *src_net, struct 
net_device *dev,
} else
child_pkey  = nla_get_u16(data[IFLA_IPOIB_PKEY]);
 
-   if (child_pkey == 0 || child_pkey == 0x8000)
-   return -EINVAL;
-
err = __ipoib_vlan_add(ppriv, ipoib_priv(dev),
   child_pkey, IPOIB_RTNL_CHILD);
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 
b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index ca3a7f6c0998..341753fbda54 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -50,6 +50,39 @@ static ssize_t show_parent(struct device *d, struct 
device_attribute *attr,
 }
 static DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL);
 
+static bool is_child_unique(struct ipoib_dev_priv *ppriv,
+   struct ipoib_dev_priv *priv)
+{
+   struct ipoib_dev_priv *tpriv;
+
+   ASSERT_RTNL();
+
+   /*
+* Since the legacy sysfs interface uses pkey for deletion it cannot
+* support more than one interface with the same pkey, it creates
+* ambiguity.  The RTNL interface deletes using the netdev so it does
+* not have a problem to support duplicated pkeys.
+*/
+   if (priv->child_type != IPOIB_LEGACY_CHILD)
+   return true;
+
+   /*
+* First ensure this isn't a duplicate. We check the parent device and
+* then all of the legacy child interfaces to make sure the Pkey
+* doesn't match.
+*/
+   if (ppriv->pkey == priv->pkey)
+   return false;
+
+   list_for_each_entry(tpriv, &ppriv->child_intfs, list) {
+   if (tpriv->pkey == priv->pkey &&
+   tpriv->child_type == IPOIB_LEGACY_CHILD)
+   return false;
+   }
+
+   return true;
+}
+
 /*
  * NOTE: If this function fails then the priv->dev will remain valid, however
  * priv can have been freed and must not be touched by caller in the error
@@ -73,10 +106,20 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct 
ipoib_dev_priv *priv,
 */
WARN_ON(ppriv->dev->reg_state != NETREG_REGISTERED);
 
+   if (pkey == 0 || pkey == 0x8000) {
+   result = -EINVAL;
+   goto out_early;
+   }
+
priv->parent = ppriv->dev;
priv->pkey = pkey;
priv->child_type = type;
 
+   if (!is_child_unique(ppriv, priv)) {
+   result = -ENOTUNIQ;
+   goto out_early;
+   }
+
/* We do not need to touch priv if register_netdevice fails */
ndev->priv_destructor = ipoib_intf_free;
 
@@ -88,9 +131,7 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct 
ipoib_dev_priv *priv,
 * register_netdevice sometimes calls priv_destructor,
 * sometimes not. Make sure it was done.
 */
-   if (ndev->priv_destructor)
-   ndev->priv_destructor(ndev);
-   return result;
+   goto out_early;
}
 
/* RTNL childs don't need proprietary sysfs entries */
@@ -111,6 +152,11 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct 
ipoib_dev_priv *priv,
 sysfs_failed:
unregister_netdevice(priv->dev);
return -ENOMEM;
+
+out_early:
+   if (ndev->priv_destructor)
+   ndev->priv_destructor(ndev);
+   return result;
 }
 
 int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
@@ -118,17 +164,11 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned 
short pkey)
struct ipoib_dev_priv *ppriv, *priv;
char intf_name[IFNAMSIZ];
struct net_device *ndev;
-   struct ipoib_dev_priv *tpriv;
int result;
 
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 
-   ppriv = ipoib_priv(pdev);
-
-   snprintf(intf_name, sizeof(intf_name), "%s.%04x",
-ppriv->dev->name, pkey);
-
if (!rtnl_trylock())
return restart_syscall();
 
@@ -137,23 +177,10 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned 
short pkey)
return -EPERM;
}
 
-   /*
-* F

[PATCH rdma-next 06/10] RDMA/netdev: Use priv_destructor for netdev cleanup

2018-07-29 Thread Leon Romanovsky

From: Jason Gunthorpe 

Now that the unregister_netdev flow for IPoIB no longer relies on external
code we can now introduce the use of priv_destructor and
needs_free_netdev.

The rdma_netdev flow is switched to use the netdev common priv_destructor
instead of the special free_rdma_netdev and the IPOIB ULP adjusted:
 - priv_destructor needs to switch to point to the ULP's destructor
   which will then call the rdma_ndev's in the right order
 - We need to be careful around the error unwind of register_netdev
   as it sometimes calls priv_destructor on failure
 - ULPs need to use ndo_init/uninit to ensure proper ordering
   of failures around register_netdev

Switching to priv_destructor is a necessary pre-requisite to using
the rtnl new_link mechanism.

The VNIC user for rdma_netdev should also be revised, but that is left for
another patch.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Denis Drozdov 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/main.c  |  10 --
 drivers/infiniband/ulp/ipoib/ipoib.h   |   2 +
 drivers/infiniband/ulp/ipoib/ipoib_main.c  | 101 +
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c  |  68 --
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |  37 
 include/linux/mlx5/driver.h|   3 -
 include/rdma/ib_verbs.h|   6 +-
 7 files changed, 129 insertions(+), 98 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index a93ef6367d38..9011187ca081 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -5157,11 +5157,6 @@ static int mlx5_ib_get_hw_stats(struct ib_device *ibdev,
return num_counters;
 }

-static void mlx5_ib_free_rdma_netdev(struct net_device *netdev)
-{
-   return mlx5_rdma_netdev_free(netdev);
-}
-
 static struct net_device*
 mlx5_ib_alloc_rdma_netdev(struct ib_device *hca,
  u8 port_num,
@@ -5171,17 +5166,12 @@ mlx5_ib_alloc_rdma_netdev(struct ib_device *hca,
  void (*setup)(struct net_device *))
 {
struct net_device *netdev;
-   struct rdma_netdev *rn;

if (type != RDMA_NETDEV_IPOIB)
return ERR_PTR(-EOPNOTSUPP);

netdev = mlx5_rdma_netdev_alloc(to_mdev(hca)->mdev, hca,
name, setup);
-   if (likely(!IS_ERR_OR_NULL(netdev))) {
-   rn = netdev_priv(netdev);
-   rn->free_rdma_netdev = mlx5_ib_free_rdma_netdev;
-   }
return netdev;
 }

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 02ad1a60dc80..d2cb0a8500e3 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -323,6 +323,7 @@ struct ipoib_dev_priv {
spinlock_t lock;

struct net_device *dev;
+   void (*next_priv_destructor)(struct net_device *dev);

struct napi_struct send_napi;
struct napi_struct recv_napi;
@@ -481,6 +482,7 @@ static inline void ipoib_put_ah(struct ipoib_ah *ah)
kref_put(&ah->ref, ipoib_free_ah);
 }
 int ipoib_open(struct net_device *dev);
+void ipoib_intf_free(struct net_device *dev);
 int ipoib_add_pkey_attr(struct net_device *dev);
 int ipoib_add_umcast_attr(struct net_device *dev);

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 67ab52eec3e9..73d917d57f93 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -2062,6 +2062,13 @@ void ipoib_setup_common(struct net_device *dev)
netif_keep_dst(dev);

memcpy(dev->broadcast, ipv4_bcast_addr, INFINIBAND_ALEN);
+
+   /*
+* unregister_netdev always frees the netdev, we use this mode
+* consistently to unify all the various unregister paths, including
+* those connected to rtnl_link_ops which require it.
+*/
+   dev->needs_free_netdev = true;
 }

 static void ipoib_build_priv(struct net_device *dev)
@@ -2116,9 +2123,7 @@ static struct net_device
rn->send = ipoib_send;
rn->attach_mcast = ipoib_mcast_attach;
rn->detach_mcast = ipoib_mcast_detach;
-   rn->free_rdma_netdev = free_netdev;
rn->hca = hca;
-
dev->netdev_ops = &ipoib_netdev_default_pf;

return dev;
@@ -2173,6 +2178,15 @@ struct ipoib_dev_priv *ipoib_intf_alloc(struct ib_device 
*hca, u8 port,

rn = netdev_priv(dev);
rn->clnt_priv = priv;
+
+   /*
+* Only the child register_netdev flows can handle priv_destructor
+* being set, so we force it to NULL here and handle manually until it
+* is safe to turn on.
+*/
+   priv->next_priv_destructor = dev->priv_destructor;
+   dev->priv_destructor = NULL;
+
ipoib_build_priv(dev);

return priv;
@@ -2181,6 +2195,27 @@ struct ipoib_dev_pri

[PATCH rdma-next 09/10] IB/ipoib: Maintain the child_intfs list from ndo_init/uninit

2018-07-29 Thread Leon Romanovsky

From: Jason Gunthorpe 

This fixes a bug in the netlink path where the vlan_rwsem was not
held around __ipoib_vlan_add causing the child_intfs to be manipulated
unsafely.

In the process this greatly simplifies the vlan_rwsem write side locking
to only cover a single non-sleeping statement.

This also further increases the safety of the removal ordering by holding
the netdev of the parent while the child is active to ensure most bugs
become either an oops on a NULL priv or a deadlock on the netdev refcount.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c| 16 
 drivers/infiniband/ulp/ipoib/ipoib_netlink.c | 14 --
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c| 12 
 3 files changed, 16 insertions(+), 26 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index b2fe23d60103..e3d28f9ad9c0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1890,6 +1890,12 @@ static void ipoib_child_init(struct net_device *ndev)
struct ipoib_dev_priv *priv = ipoib_priv(ndev);
struct ipoib_dev_priv *ppriv = ipoib_priv(priv->parent);
 
+   dev_hold(priv->parent);
+
+   down_write(&ppriv->vlan_rwsem);
+   list_add_tail(&priv->list, &ppriv->child_intfs);
+   up_write(&ppriv->vlan_rwsem);
+
priv->max_ib_mtu = ppriv->max_ib_mtu;
set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags);
memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN);
@@ -1959,6 +1965,16 @@ static void ipoib_ndo_uninit(struct net_device *dev)
destroy_workqueue(priv->wq);
priv->wq = NULL;
}
+
+   if (priv->parent) {
+   struct ipoib_dev_priv *ppriv = ipoib_priv(priv->parent);
+
+   down_write(&ppriv->vlan_rwsem);
+   list_del(&priv->list);
+   up_write(&ppriv->vlan_rwsem);
+
+   dev_put(priv->parent);
+   }
 }
 
 static int ipoib_set_vf_link_state(struct net_device *dev, int vf, int 
link_state)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_netlink.c 
b/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
index a86928a80c08..7e093b7aad8f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_netlink.c
@@ -133,19 +133,6 @@ static int ipoib_new_child_link(struct net *src_net, 
struct net_device *dev,
return err;
 }
 
-static void ipoib_unregister_child_dev(struct net_device *dev, struct 
list_head *head)
-{
-   struct ipoib_dev_priv *priv, *ppriv;
-
-   priv = ipoib_priv(dev);
-   ppriv = ipoib_priv(priv->parent);
-
-   down_write(&ppriv->vlan_rwsem);
-   unregister_netdevice_queue(dev, head);
-   list_del(&priv->list);
-   up_write(&ppriv->vlan_rwsem);
-}
-
 static size_t ipoib_get_size(const struct net_device *dev)
 {
return nla_total_size(2) +  /* IFLA_IPOIB_PKEY   */
@@ -161,7 +148,6 @@ static struct rtnl_link_ops ipoib_link_ops __read_mostly = {
.setup  = ipoib_setup_common,
.newlink= ipoib_new_child_link,
.changelink = ipoib_changelink,
-   .dellink= ipoib_unregister_child_dev,
.get_size   = ipoib_get_size,
.fill_info  = ipoib_fill_info,
 };
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 
b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index fa4dfcee2644..ca3a7f6c0998 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -106,8 +106,6 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct 
ipoib_dev_priv *priv,
goto sysfs_failed;
}
 
-   list_add_tail(&priv->list, &ppriv->child_intfs);
-
return 0;
 
 sysfs_failed:
@@ -139,11 +137,6 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short 
pkey)
return -EPERM;
}
 
-   if (!down_write_trylock(&ppriv->vlan_rwsem)) {
-   rtnl_unlock();
-   return restart_syscall();
-   }
-
/*
 * First ensure this isn't a duplicate. We check the parent device and
 * then all of the legacy child interfaces to make sure the Pkey
@@ -175,7 +168,6 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short 
pkey)
free_netdev(ndev);
 
 out:
-   up_write(&ppriv->vlan_rwsem);
rtnl_unlock();
 
return result;
@@ -209,10 +201,6 @@ static void ipoib_vlan_delete_task(struct work_struct 
*work)
struct ipoib_dev_priv *priv = ipoib_priv(dev);
struct ipoib_dev_priv *ppriv = ipoib_priv(priv->parent);
 
-   down_write(&ppriv->vlan_rwsem);
-   list_del(&priv->list);
-   up_write(&ppriv->vlan_rwsem);
-
ipoib_dbg(ppriv, "delete child vlan %s\n", dev->name);
unregister_netdevice(dev

[PATCH rdma-next 00/10] IPoIB uninit

2018-07-29 Thread Leon Romanovsky

From: Leon Romanovsky 

IP link was broken due to the changes in IPoIB for the rdma_netdev
support after commit cd565b4b51e5 ("IB/IPoIB: Support acceleration options 
callbacks").

This patchset restores IPoIB pkey creation and removal using rtnetlink.

It is completely rewritten variant of
https://marc.info/?l=linux-rdma&m=151553425830918&w=2 patch series.

Thanks

Erez Shitrit (2):
  IB/ipoib: Use cancel_delayed_work_sync for neigh-clean task
  IB/ipoib: Make ipoib_neigh_hash_uninit fully fence its work

Jason Gunthorpe (8):
  IB/ipoib: Get rid of IPOIB_FLAG_GOING_DOWN
  IB/ipoib: Move all uninit code into ndo_uninit
  IB/ipoib: Move init code to ndo_init
  RDMA/netdev: Use priv_destructor for netdev cleanup
  IB/ipoib: Get rid of the sysfs_mutex
  IB/ipoib: Do not remove child devices from within the ndo_uninit
  IB/ipoib: Maintain the child_intfs list from ndo_init/uninit
  IB/ipoib: Consolidate checking of the proposed child interface

 drivers/infiniband/hw/mlx5/main.c  |  10 -
 drivers/infiniband/ulp/ipoib/ipoib.h   |  16 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c|  14 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  | 419 -
 drivers/infiniband/ulp/ipoib/ipoib_netlink.c   |  23 --
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c  | 259 +++--
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |  37 +-
 include/linux/mlx5/driver.h|   3 -
 include/rdma/ib_verbs.h|   6 +-
 9 files changed, 428 insertions(+), 359 deletions(-)

--
2.14.4

[PATCH rdma-next 07/10] IB/ipoib: Get rid of the sysfs_mutex

2018-07-29 Thread Leon Romanovsky

From: Jason Gunthorpe 

This mutex was introduced to deal with the deadlock formed by calling
unregister_netdev from within the sysfs callback of a netdev.

Now that we have priv_destructor and needs_free_netdev we can switch
to the more targeted solution of running the unregister from a
work queue. This avoids the deadlock and gets rid of the mutex.

The next patch in the series needs this mutex eliminated to create
atomicity of unregisteration.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/ipoib/ipoib.h  |  1 -
 drivers/infiniband/ulp/ipoib/ipoib_cm.c   |  7 ---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |  7 +--
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 98 ---
 4 files changed, 65 insertions(+), 48 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index d2cb0a8500e3..804cb4bee57d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -332,7 +332,6 @@ struct ipoib_dev_priv {
 
struct rw_semaphore vlan_rwsem;
struct mutex mcast_mutex;
-   struct mutex sysfs_mutex;
 
struct rb_root  path_tree;
struct list_head path_list;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 83fa402e5d03..04785d3f8195 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -1514,19 +1514,13 @@ static ssize_t set_mode(struct device *d, struct 
device_attribute *attr,
 {
struct net_device *dev = to_net_dev(d);
int ret;
-   struct ipoib_dev_priv *priv = ipoib_priv(dev);
-
-   if (!mutex_trylock(&priv->sysfs_mutex))
-   return restart_syscall();
 
if (!rtnl_trylock()) {
-   mutex_unlock(&priv->sysfs_mutex);
return restart_syscall();
}
 
if (dev->reg_state != NETREG_REGISTERED) {
rtnl_unlock();
-   mutex_unlock(&priv->sysfs_mutex);
return -EPERM;
}
 
@@ -1538,7 +1532,6 @@ static ssize_t set_mode(struct device *d, struct 
device_attribute *attr,
 */
if (ret != -EBUSY)
rtnl_unlock();
-   mutex_unlock(&priv->sysfs_mutex);
 
return (!ret || ret == -EBUSY) ? count : ret;
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 73d917d57f93..e9f4f261fe20 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -2079,7 +2079,6 @@ static void ipoib_build_priv(struct net_device *dev)
spin_lock_init(&priv->lock);
init_rwsem(&priv->vlan_rwsem);
mutex_init(&priv->mcast_mutex);
-   mutex_init(&priv->sysfs_mutex);
 
INIT_LIST_HEAD(&priv->path_list);
INIT_LIST_HEAD(&priv->child_intfs);
@@ -2476,10 +2475,7 @@ static void ipoib_remove_one(struct ib_device *device, 
void *client_data)
list_for_each_entry_safe(priv, tmp, dev_list, list) {
ipoib_parent_unregister_pre(priv->dev);
 
-   /* Wrap rtnl_lock/unlock with mutex to protect sysfs calls */
-   mutex_lock(&priv->sysfs_mutex);
unregister_netdev(priv->dev);
-   mutex_unlock(&priv->sysfs_mutex);
}
 
kfree(dev_list);
@@ -2527,8 +2523,7 @@ static int __init ipoib_init_module(void)
 * its private workqueue, and we only queue up flush events
 * on our global flush workqueue.  This avoids the deadlocks.
 */
-   ipoib_workqueue = alloc_ordered_workqueue("ipoib_flush",
- WQ_MEM_RECLAIM);
+   ipoib_workqueue = alloc_ordered_workqueue("ipoib_flush", 0);
if (!ipoib_workqueue) {
ret = -ENOMEM;
goto err_fs;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 
b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index 7776334cf8c5..891c5b40018a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -125,23 +125,16 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned 
short pkey)
snprintf(intf_name, sizeof(intf_name), "%s.%04x",
 ppriv->dev->name, pkey);
 
-   if (!mutex_trylock(&ppriv->sysfs_mutex))
+   if (!rtnl_trylock())
return restart_syscall();
 
-   if (!rtnl_trylock()) {
-   mutex_unlock(&ppriv->sysfs_mutex);
-   return restart_syscall();
-   }
-
if (pdev->reg_state != NETREG_REGISTERED) {
rtnl_unlock();
-   mutex_unlock(&ppriv->sysfs_mutex);
return -EPERM;
}
 
if (!down_write_trylock(&ppriv->vlan_rwsem)) {
rtnl_unlock();
-   mutex_unlock(&ppriv->sysfs_mutex);
return restart_syscall();
}
 
@@ -178,58 +171,95 @@ int ipoib_vlan_a

[PATCH rdma-next 05/10] IB/ipoib: Move init code to ndo_init

2018-07-29 Thread Leon Romanovsky

From: Jason Gunthorpe 

Now that we have a proper ndo_uninit, move code that naturally pairs
with the ndo_uninit into ndo_init. This allows the netdev core to natually
handle ordering.

This fixes the situation where register_netdev can fail before calling
ndo_init, in which case it wouldn't call ndo_uninit either.

Also move a bunch of duplicated init code that is shared between child
and parent for clarity. Now the child and parent register functions look
very similar.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/ipoib/ipoib.h |   3 -
 drivers/infiniband/ulp/ipoib/ipoib_main.c| 193 +++
 drivers/infiniband/ulp/ipoib/ipoib_netlink.c |   6 -
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c|  31 +
 4 files changed, 114 insertions(+), 119 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 04fc5ad1b69f..02ad1a60dc80 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -508,8 +508,6 @@ void ipoib_ib_dev_down(struct net_device *dev);
 int ipoib_ib_dev_stop_default(struct net_device *dev);
 void ipoib_pkey_dev_check_presence(struct net_device *dev);
 
-int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
-
 void ipoib_mcast_join_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
@@ -597,7 +595,6 @@ void ipoib_pkey_open(struct ipoib_dev_priv *priv);
 void ipoib_drain_cq(struct net_device *dev);
 
 void ipoib_set_ethtool_ops(struct net_device *dev);
-void ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device 
*hca);
 
 #define IPOIB_FLAGS_RC 0x80
 #define IPOIB_FLAGS_UC 0x40
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index d4e9951dc539..67ab52eec3e9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1741,13 +1741,11 @@ static int ipoib_ioctl(struct net_device *dev, struct 
ifreq *ifr,
return priv->rn_ops->ndo_do_ioctl(dev, ifr, cmd);
 }
 
-int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
+static int ipoib_dev_init(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = ipoib_priv(dev);
int ret = -ENOMEM;
 
-   priv->ca = ca;
-   priv->port = port;
priv->qp = NULL;
 
/*
@@ -1763,7 +1761,7 @@ int ipoib_dev_init(struct net_device *dev, struct 
ib_device *ca, int port)
/* create pd, which used both for control and datapath*/
priv->pd = ib_alloc_pd(priv->ca, 0);
if (IS_ERR(priv->pd)) {
-   pr_warn("%s: failed to allocate PD\n", ca->name);
+   pr_warn("%s: failed to allocate PD\n", priv->ca->name);
goto clean_wq;
}
 
@@ -1837,6 +1835,108 @@ static void ipoib_parent_unregister_pre(struct 
net_device *ndev)
flush_workqueue(ipoib_workqueue);
 }
 
+static void ipoib_set_dev_features(struct ipoib_dev_priv *priv)
+{
+   priv->hca_caps = priv->ca->attrs.device_cap_flags;
+
+   if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) {
+   priv->dev->hw_features |= NETIF_F_IP_CSUM | NETIF_F_RXCSUM;
+
+   if (priv->hca_caps & IB_DEVICE_UD_TSO)
+   priv->dev->hw_features |= NETIF_F_TSO;
+
+   priv->dev->features |= priv->dev->hw_features;
+   }
+}
+
+static int ipoib_parent_init(struct net_device *ndev)
+{
+   struct ipoib_dev_priv *priv = ipoib_priv(ndev);
+   struct ib_port_attr attr;
+   int result;
+
+   result = ib_query_port(priv->ca, priv->port, &attr);
+   if (result) {
+   pr_warn("%s: ib_query_port %d failed\n", priv->ca->name,
+   priv->port);
+   return result;
+   }
+   priv->max_ib_mtu = ib_mtu_enum_to_int(attr.max_mtu);
+
+   result = ib_query_pkey(priv->ca, priv->port, 0, &priv->pkey);
+   if (result) {
+   pr_warn("%s: ib_query_pkey port %d failed (ret = %d)\n",
+   priv->ca->name, priv->port, result);
+   return result;
+   }
+
+   result = rdma_query_gid(priv->ca, priv->port, 0, &priv->local_gid);
+   if (result) {
+   pr_warn("%s: rdma_query_gid port %d failed (ret = %d)\n",
+   priv->ca->name, priv->port, result);
+   return result;
+   }
+   memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw,
+  sizeof(union ib_gid));
+
+   SET_NETDEV_DEV(priv->dev, priv->ca->dev.parent);
+   priv->dev->dev_id = priv->port - 1;
+
+   return 0;
+}
+
+static void ipoib_child_init(struct net_device *ndev)
+{
+   struct ipoib_dev_priv *priv = ipoib_priv(ndev);
+   struct ipoib_dev_priv *ppriv = ipoib_priv(priv->parent);
+
+   priv->max_

[PATCH rdma-next 01/10] IB/ipoib: Get rid of IPOIB_FLAG_GOING_DOWN

2018-07-29 Thread Leon Romanovsky

From: Jason Gunthorpe 

This essentially duplicates the netdev's reg_state, so just use that
directly. The reg_state is updated under the rntl_lock, and all places
using GOING_DOWN already acquire the rtnl_lock so checking is safe.

Since the only place we use GOING_DOWN is for the parent device this
does not fix any bugs, but it is a step to tidy up the unregister flow
so that after later patches the flow is uniform and sane.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/ipoib/ipoib.h  |  1 -
 drivers/infiniband/ulp/ipoib/ipoib_cm.c   |  9 ++---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |  3 ---
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 18 --
 4 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index e255a7e5a4c3..9eebb705d994 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -95,7 +95,6 @@ enum {
IPOIB_NEIGH_TBL_FLUSH = 12,
IPOIB_FLAG_DEV_ADDR_SET   = 13,
IPOIB_FLAG_DEV_ADDR_CTRL  = 14,
-   IPOIB_FLAG_GOING_DOWN = 15,
 
IPOIB_MAX_BACKOFF_SECONDS = 16,
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 8b44f33c7ae0..83fa402e5d03 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -1516,9 +1516,6 @@ static ssize_t set_mode(struct device *d, struct 
device_attribute *attr,
int ret;
struct ipoib_dev_priv *priv = ipoib_priv(dev);
 
-   if (test_bit(IPOIB_FLAG_GOING_DOWN, &priv->flags))
-   return -EPERM;
-
if (!mutex_trylock(&priv->sysfs_mutex))
return restart_syscall();
 
@@ -1527,6 +1524,12 @@ static ssize_t set_mode(struct device *d, struct 
device_attribute *attr,
return restart_syscall();
}
 
+   if (dev->reg_state != NETREG_REGISTERED) {
+   rtnl_unlock();
+   mutex_unlock(&priv->sysfs_mutex);
+   return -EPERM;
+   }
+
ret = ipoib_set_mode(dev, buf);
 
/* The assumption is that the function ipoib_set_mode returned
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 82f0e3869b04..7ca9013bf05c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -2406,9 +2406,6 @@ static void ipoib_remove_one(struct ib_device *device, 
void *client_data)
ib_unregister_event_handler(&priv->event_handler);
flush_workqueue(ipoib_workqueue);
 
-   /* mark interface in the middle of destruction */
-   set_bit(IPOIB_FLAG_GOING_DOWN, &priv->flags);
-
rtnl_lock();
dev_change_flags(priv->dev, priv->dev->flags & ~IFF_UP);
rtnl_unlock();
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 
b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index b067ad5e4c7e..1b7bfd500893 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -127,9 +127,6 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short 
pkey)
 
ppriv = ipoib_priv(pdev);
 
-   if (test_bit(IPOIB_FLAG_GOING_DOWN, &ppriv->flags))
-   return -EPERM;
-
snprintf(intf_name, sizeof(intf_name), "%s.%04x",
 ppriv->dev->name, pkey);
 
@@ -141,6 +138,12 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short 
pkey)
return restart_syscall();
}
 
+   if (pdev->reg_state != NETREG_REGISTERED) {
+   rtnl_unlock();
+   mutex_unlock(&ppriv->sysfs_mutex);
+   return -EPERM;
+   }
+
if (!down_write_trylock(&ppriv->vlan_rwsem)) {
rtnl_unlock();
mutex_unlock(&ppriv->sysfs_mutex);
@@ -199,9 +202,6 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned 
short pkey)
 
ppriv = ipoib_priv(pdev);
 
-   if (test_bit(IPOIB_FLAG_GOING_DOWN, &ppriv->flags))
-   return -EPERM;
-
if (!mutex_trylock(&ppriv->sysfs_mutex))
return restart_syscall();
 
@@ -210,6 +210,12 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned 
short pkey)
return restart_syscall();
}
 
+   if (pdev->reg_state != NETREG_REGISTERED) {
+   rtnl_unlock();
+   mutex_unlock(&ppriv->sysfs_mutex);
+   return -EPERM;
+   }
+
if (!down_write_trylock(&ppriv->vlan_rwsem)) {
rtnl_unlock();
mutex_unlock(&ppriv->sysfs_mutex);
-- 
2.14.4

[PATCH rdma-next 03/10] IB/ipoib: Move all uninit code into ndo_uninit

2018-07-29 Thread Leon Romanovsky

From: Jason Gunthorpe 

Currently uninit is sometimes done twice in error flows, and is sprinkled
a bit all over the place.

Improve the clarity of the design by moving all uninit only into
ndo_uinit.

Some duplication is removed:
 - Sometimes IPOIB_STOP_NEIGH_GC was done before unregister, but
   this duplicates the process in ipoib_neigh_hash_init
 - Flushing priv->wq was sometimes done before unregister,
   but that duplicates what has been done in ndo_uninit

Uniniting the IB event queue must remain before unregister_netdev as it
requires the RTNL lock to be dropped, this is moved to a helper to make
that flow really clear and remove some duplication in error flows.

If register_netdev fails (and ndo_init is NULL) then it almost always
calls ndo_uninit, which lets us remove all the extra code from the error
unwinds. The next patch in the series will close the 'almost always' hole
by pairing a proper ndo_init with ndo_uninit.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/ipoib/ipoib.h  |  1 -
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 62 ---
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c |  5 +--
 3 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 9eebb705d994..48e9ea50ca87 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -510,7 +510,6 @@ int ipoib_ib_dev_stop_default(struct net_device *dev);
 void ipoib_pkey_dev_check_presence(struct net_device *dev);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
-void ipoib_dev_cleanup(struct net_device *dev);
 
 void ipoib_mcast_join_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 7cd42619deb2..4eec2781e83f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -215,11 +215,6 @@ static int ipoib_stop(struct net_device *dev)
return 0;
 }
 
-static void ipoib_uninit(struct net_device *dev)
-{
-   ipoib_dev_cleanup(dev);
-}
-
 static netdev_features_t ipoib_fix_features(struct net_device *dev, 
netdev_features_t features)
 {
struct ipoib_dev_priv *priv = ipoib_priv(dev);
@@ -1826,7 +1821,33 @@ int ipoib_dev_init(struct net_device *dev, struct 
ib_device *ca, int port)
return ret;
 }
 
-void ipoib_dev_cleanup(struct net_device *dev)
+/*
+ * This must be called before doing an unregister_netdev on a parent device to
+ * shutdown the IB event handler.
+ */
+static void ipoib_parent_unregister_pre(struct net_device *ndev)
+{
+   struct ipoib_dev_priv *priv = ipoib_priv(ndev);
+
+   /*
+* ipoib_set_mac checks netif_running before pushing work, clearing
+* running ensures the it will not add more work.
+*/
+   rtnl_lock();
+   dev_change_flags(priv->dev, priv->dev->flags & ~IFF_UP);
+   rtnl_unlock();
+
+   /* ipoib_event() cannot be running once this returns */
+   ib_unregister_event_handler(&priv->event_handler);
+
+   /*
+* Work on the queue grabs the rtnl lock, so this cannot be done while
+* also holding it.
+*/
+   flush_workqueue(ipoib_workqueue);
+}
+
+static void ipoib_ndo_uninit(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = ipoib_priv(dev), *cpriv, *tcpriv;
LIST_HEAD(head);
@@ -1899,7 +1920,7 @@ static const struct header_ops ipoib_header_ops = {
 };
 
 static const struct net_device_ops ipoib_netdev_ops_pf = {
-   .ndo_uninit  = ipoib_uninit,
+   .ndo_uninit  = ipoib_ndo_uninit,
.ndo_open= ipoib_open,
.ndo_stop= ipoib_stop,
.ndo_change_mtu  = ipoib_change_mtu,
@@ -1918,7 +1939,7 @@ static const struct net_device_ops ipoib_netdev_ops_pf = {
 };
 
 static const struct net_device_ops ipoib_netdev_ops_vf = {
-   .ndo_uninit  = ipoib_uninit,
+   .ndo_uninit  = ipoib_ndo_uninit,
.ndo_open= ipoib_open,
.ndo_stop= ipoib_stop,
.ndo_change_mtu  = ipoib_change_mtu,
@@ -2321,7 +2342,8 @@ static struct net_device *ipoib_add_port(const char 
*format,
if (result) {
pr_warn("%s: couldn't register ipoib port %d; error %d\n",
hca->name, port, result);
-   goto register_failed;
+   ipoib_parent_unregister_pre(priv->dev);
+   goto device_init_failed;
}
 
result = -ENOMEM;
@@ -2339,17 +2361,9 @@ static struct net_device *ipoib_add_port(const char 
*format,
return priv->dev;
 
 sysfs_failed:
+   ipoib_parent_unregister_pre(priv->dev);
unregister_netdev(priv->dev);
 
-register_failed:
-   ib_u

[PATCH rdma-next 04/10] IB/ipoib: Make ipoib_neigh_hash_uninit fully fence its work

2018-07-29 Thread Leon Romanovsky

From: Erez Shitrit 

The neigh_reap_task is self restarting, but so long as we call
cancel_delayed_work_sync() it will be guaranteed to not be running and
never start again. Thus there is no longer any purpose to
IPOIB_STOP_NEIGH_GC (which was racy/buggy anyhow), so remove it too.

This fixes a situation where the GC work could have been running

Signed-off-by: Erez Shitrit 
Signed-off-by: Jason Gunthorpe 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/ulp/ipoib/ipoib.h  |  1 -
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 25 +++--
 2 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 48e9ea50ca87..04fc5ad1b69f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -91,7 +91,6 @@ enum {
IPOIB_STOP_REAPER = 7,
IPOIB_FLAG_ADMIN_CM   = 9,
IPOIB_FLAG_UMCAST = 10,
-   IPOIB_STOP_NEIGH_GC   = 11,
IPOIB_NEIGH_TBL_FLUSH = 12,
IPOIB_FLAG_DEV_ADDR_SET   = 13,
IPOIB_FLAG_DEV_ADDR_CTRL  = 14,
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 4eec2781e83f..d4e9951dc539 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1305,9 +1305,6 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
int i;
LIST_HEAD(remove_list);
 
-   if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
-   return;
-
spin_lock_irqsave(&priv->lock, flags);
 
htbl = rcu_dereference_protected(ntbl->htbl,
@@ -1319,9 +1316,6 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
/* neigh is obsolete if it was idle for two GC periods */
dt = 2 * arp_tbl.gc_interval;
neigh_obsolete = jiffies - dt;
-   /* handle possible race condition */
-   if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
-   goto out_unlock;
 
for (i = 0; i < htbl->size; i++) {
struct ipoib_neigh *neigh;
@@ -1359,9 +1353,8 @@ static void ipoib_reap_neigh(struct work_struct *work)
 
__ipoib_reap_neigh(priv);
 
-   if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
-   queue_delayed_work(priv->wq, &priv->neigh_reap_task,
-  arp_tbl.gc_interval);
+   queue_delayed_work(priv->wq, &priv->neigh_reap_task,
+  arp_tbl.gc_interval);
 }
 
 
@@ -1523,7 +1516,6 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv 
*priv)
htbl = kzalloc(sizeof(*htbl), GFP_KERNEL);
if (!htbl)
return -ENOMEM;
-   set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
size = roundup_pow_of_two(arp_tbl.gc_thresh3);
buckets = kvcalloc(size, sizeof(*buckets), GFP_KERNEL);
if (!buckets) {
@@ -1538,7 +1530,6 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv 
*priv)
atomic_set(&ntbl->entries, 0);
 
/* start garbage collection */
-   clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
queue_delayed_work(priv->wq, &priv->neigh_reap_task,
   arp_tbl.gc_interval);
 
@@ -1648,15 +1639,11 @@ static void ipoib_flush_neighs(struct ipoib_dev_priv 
*priv)
 static void ipoib_neigh_hash_uninit(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = ipoib_priv(dev);
-   int stopped;
 
ipoib_dbg(priv, "ipoib_neigh_hash_uninit\n");
init_completion(&priv->ntbl.deleted);
 
-   /* Stop GC if called at init fail need to cancel work */
-   stopped = test_and_set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
-   if (!stopped)
-   cancel_delayed_work_sync(&priv->neigh_reap_task);
+   cancel_delayed_work_sync(&priv->neigh_reap_task);
 
ipoib_flush_neighs(priv);
 
@@ -1796,12 +1783,15 @@ int ipoib_dev_init(struct net_device *dev, struct 
ib_device *ca, int port)
if (ipoib_ib_dev_open(dev)) {
pr_warn("%s failed to open device\n", dev->name);
ret = -ENODEV;
-   goto out_dev_uninit;
+   goto out_hash_uninit;
}
}
 
return 0;
 
+out_hash_uninit:
+   ipoib_neigh_hash_uninit(dev);
+
 out_dev_uninit:
ipoib_ib_dev_cleanup(dev);
 
@@ -1857,7 +1847,6 @@ static void ipoib_ndo_uninit(struct net_device *dev)
/* Delete any child interfaces first */
list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) {
/* Stop GC on child */
-   set_bit(IPOIB_STOP_NEIGH_GC, &cpriv->flags);
cancel_delayed_work_sync(&cpriv->neigh_reap_task);
unregister_netdevice_queue(cpriv->dev, &head);
}
-- 
2.14.4

Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-29 Thread Moshe Shemesh

On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas  wrote:

> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko  wrote:
> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com
> wrote:
> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
> > >>>  The devlink params haven't been upstream even for a full cycle
> and
> > >>>  already you guys are starting to use them to configure standard
> > >>>  features like queuing.
> > >>> >>>
> > >>> >>> We developed the devlink params in order to support non-standard
> > >>> >>> configuration only. And for non-standard, there are generic and
> vendor
> > >>> >>> specific options.
> > >>> >>
> > >>> >> I thought it was developed for performing non-standard and
> possibly
> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
> for
> > >>> >> examples of well justified generic options for which we have no
> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
> if you
> > >>> >> ask me, too.
> > >>> >>
> > >>> >> Configuring queuing has an API.  The question is it acceptable to
> enter
> > >>> >> into the risky territory of controlling offloads via devlink
> parameters
> > >>> >> or would we rather make vendors take the time and effort to model
> > >>> >> things to (a subset) of existing APIs.  The HW never fits the APIs
> > >>> >> perfectly.
> > >>> >
> > >>> > I understand what you meant here, I would like to highlight that
> this
> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
> > >>> > The vendor specific configuration suggested here is to handle a
> congestion
> > >>> > state in Multi Host environment (which includes PF and multiple
> VFs per
> > >>> > host), where one host is not aware to the other hosts, and each is
> running
> > >>> > on its own pci/driver. It is a device working mode configuration.
> > >>> >
> > >>> > This  couldn't fit into any existing API, thus creating this
> vendor specific
> > >>> > unique API is needed.
> > >>>
> > >>> If we are just going to start creating devlink interfaces in for
> every
> > >>> one-off option a device wants to add why did we even bother with
> > >>> trying to prevent drivers from using sysfs? This just feels like we
> > >>> are back to the same arguments we had back in the day with it.
> > >>>
> > >>> I feel like the bigger question here is if devlink is how we are
> going
> > >>> to deal with all PCIe related features going forward, or should we
> > >>> start looking at creating a new interface/tool for PCI/PCIe related
> > >>> features? My concern is that we have already had features such as DMA
> > >>> Coalescing that didn't really fit into anything and now we are
> > >>> starting to see other things related to DMA and PCIe bus credits. I'm
> > >>> wondering if we shouldn't start looking at a tool/interface to
> > >>> configure all the PCIe related features such as interrupts, error
> > >>> reporting, DMA configuration, power management, etc. Maybe we could
> > >>> even look at sharing it across subsystems and include things like
> > >>> storage, graphics, and other subsystems in the conversation.
> > >>
> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do need
> > >>to build up an API.  Sharing it across subsystems would be very cool!
>
> I read the thread (starting at [1], for anybody else coming in late)
> and I see this has something to do with "configuring outbound PCIe
> buffers", but I haven't seen the connection to PCIe protocol or
> features, i.e., I can't connect this to anything in the PCIe spec.
>
> Can somebody help me understand how the PCI core is relevant?  If
> there's some connection with a feature defined by PCIe, or if it
> affects the PCIe transaction protocol somehow, I'm definitely
> interested in this.  But if this only affects the data transferred
> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
> sure why the PCI core should care.
>
>

As you wrote, this is not a PCIe feature  or affects the PCIe transaction
protocol.

Actually, due to hardware limitation in current device, we have enabled a
workaround in hardware.

This mode is proprietary and not relevant to other PCIe devices, thus is
set using driver-specific parameter in devlink

> > I wonder howcome there isn't such API in place already. Or is it?
> > > If it is not, do you have any idea how should it look like? Should it
> be
> > > an extension of the existing PCI uapi or something completely new?
> > > It would be probably good to loop some PCI people in...
> >
> > The closest thing I can think of in terms of answering your questions
> > as to why we haven't seen anything like that would be setpci.
> > Basically with that tool you can go through the PCI configuration
> > space and update any piece you want. The pr

[no subject]

2018-07-29 Thread Sumitomo Rubber





--
Did you receive our representative email ?

[PATCH rdma-next 00/27] Flow actions to mutate packets

2018-07-29 Thread Leon Romanovsky

From: Leon Romanovsky 

Hi,

This is PATCH variant of RFC posted in previous week to the ML.
https://patchwork.ozlabs.org/cover/944184/

Changelog:
 RFC -> v0:
  * Patch 1 a new patch which refactors the logic
when getting a flow namespace.
  * Patch 2 was split into two.
  * Patch 3: Fixed a typo in commit message
  * Patch 5: Updated commit message
  * Patch 7: Updated commit message
Renamed:
  - MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT_ID to
MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT
  - packet_reformat_id to reformat_id in struct mlx5_flow_act
  - packet_reformat_id to encap_id in struct mlx5_esw_flow_attr
  - packet_reformat_id to encap_id in struct mlx5e_encap_entry
  - PACKET_REFORMAT to REFORMAT when printing trace points
  * Patch 9: Updated commit message
Updated function declaration in mlx5_core.h, could of lead
to compile error on bisection.
  * Patch 11: Disallow egress rules insertion when in switchdev mode
  * Patch 12: A new patch to deal with passing enum values using
the IOCTL infrastructure.
  * Patch 13: Use new enum value attribute when passing enum
mlx5_ib_uapi_flow_table_type
  * Patch 15: Don't set encap flags on flow tables if in switchdev mode
  * Patch 17: Use new enum value attribute when passing enum
mlx5_ib_uapi_flow_table_type and enum
mlx5_ib_uapi_flow_action_packet_reformat_type
  * Patch 19: Allow creation of both
MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L2_TO_L3_TUNNEL
and MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L3_TUNNEL_TO_L2 packet
reformat actions.
  * Patch 20: A new patch which allows attaching packet reformat
actions to flow tables on NIC RX.

Thanks


>From Mark:
This series exposes the ability to create flow actions which can
mutate packet headers. We do that by exposing two new verbs:
 * modify header - can change existing packet headers. packet
 * reformat - can encapsulate or decapsulate a packet.
  Once created a flow action must be attached to a steering
  rule for it to take effect.

Thanks

Guy Levi (1):
  IB/uverbs: Add IDRs array attribute type to ioctl() interface

Mark Bloch (26):
  net/mlx5: Cleanup flow namespace getter switch logic
  net/mlx5: Add proper NIC TX steering flow tables support
  net/mlx5: Export modify header alloc/dealloc functions
  net/mlx5: Add support for more namespaces when allocating modify
header
  net/mlx5: Break encap/decap into two separated flow table creation
flags
  net/mlx5: Move header encap type to IFC header file
  {net, RDMA}/mlx5: Rename encap to reformat packet
  net/mlx5: Expose new packet reformat capabilities
  net/mlx5: Pass a namespace for packet reformat ID allocation
  net/mlx5: Export packet reformat alloc/dealloc functions
  RDMA/mlx5: Add NIC TX steering support
  RDMA/uverbs: Add UVERBS_ATTR_CONST_IN to the specs language
  RDMA/mlx5: Add a new flow action verb, modify header
  RDMA/mlx5: Enable attaching modify header to steering flows
  RDMA/mlx5: Enable decap and packet reformat on flow tables
  RDMA/uverbs: Add generic function to fill in flow action object
  RDMA/mlx5: Add new flow action verb, packet reformat
  RDMA/mlx5: Enable attaching DECAP action to steering flows
  RDMA/mlx5: Extend packet reformat verbs
  RDMA/mlx5: Enable reformat on NIC RX if supported
  RDMA/mlx5: Enable attaching packet reformat action to steering flows
  RDMA/mlx5: Refactor flow action parsing to be more generic
  RDMA/mlx5: Refactor DEVX flow creation
  RDMA/mlx5: Add flow actions support to DEVX create flow
  RDMA/mlx5: Add NIC TX namespace when getting a flow table
  RDMA/mlx5: Allow creating a matcher for a NIC TX flow table

 drivers/infiniband/core/uverbs_ioctl.c | 115 ++-
 .../infiniband/core/uverbs_std_types_flow_action.c |   7 +-
 drivers/infiniband/hw/mlx5/devx.c  |   6 +-
 drivers/infiniband/hw/mlx5/flow.c  | 351 -
 drivers/infiniband/hw/mlx5/main.c  | 140 +---
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |  26 +-
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c  |   8 +-
 .../mellanox/mlx5/core/diag/fs_tracepoint.h|   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|  50 +--
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   |  87 +++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  57 ++--
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h|  11 -
 include/linux/mlx5/device.h|   6 +
 include/linux/mlx5/fs.h|  20 +-
 include/linux/mlx5/mlx5_ifc.h  |  70 ++--
 include/rdma/uverbs_ioctl.h|  98 +-
 include/rdma/uverbs_std_types.h|  12 +
 include/uapi/rdma/mlx5_user_i

[PATCH mlx5-next 02/27] net/mlx5: Add proper NIC TX steering flow tables support

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Expose the ability to add steering rules to NIC TX flow tables.
For now, we are only adding TX bypass (egress) which is used by the RDMA
side. This will allow an administrator to control outgoing traffic and
tweak it if needed, for example preforming encapsulation or rewriting
headers.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 33 +--
 include/linux/mlx5/device.h   |  6 +
 3 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 8e01f818021b..28c7301e08f4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -760,8 +760,8 @@ const struct mlx5_flow_cmds *mlx5_fs_cmd_get_default(enum 
fs_flow_table_type typ
case FS_FT_FDB:
case FS_FT_SNIFFER_RX:
case FS_FT_SNIFFER_TX:
-   return mlx5_fs_cmd_get_fw_cmds();
case FS_FT_NIC_TX:
+   return mlx5_fs_cmd_get_fw_cmds();
default:
return mlx5_fs_cmd_get_stub_cmds();
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 17bbad8ee882..8243a93e1d6c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -150,6 +150,17 @@ static struct init_tree_node {
}
 };
 
+static struct init_tree_node egress_root_fs = {
+   .type = FS_TYPE_NAMESPACE,
+   .ar_size = 1,
+   .children = (struct init_tree_node[]) {
+   ADD_PRIO(0, MLX5_BY_PASS_NUM_PRIOS, 0,
+FS_CHAINING_CAPS,
+ADD_NS(ADD_MULTIPLE_PRIO(MLX5_BY_PASS_NUM_PRIOS,
+ BY_PASS_PRIO_NUM_LEVELS))),
+   }
+};
+
 enum fs_i_lock_class {
FS_LOCK_GRANDPARENT,
FS_LOCK_PARENT,
@@ -2008,8 +2019,10 @@ struct mlx5_flow_namespace 
*mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
return &steering->sniffer_tx_root_ns->ns;
break;
case MLX5_FLOW_NAMESPACE_EGRESS:
-   if (steering->egress_root_ns)
-   return &steering->egress_root_ns->ns;
+   if (steering->egress_root_ns) {
+   steering_ns = steering->egress_root_ns;
+   prio = 0;
+   }
break;
default:
break;
@@ -2530,16 +2543,20 @@ static int init_ingress_acls_root_ns(struct 
mlx5_core_dev *dev)
 
 static int init_egress_root_ns(struct mlx5_flow_steering *steering)
 {
-   struct fs_prio *prio;
-
steering->egress_root_ns = create_root_ns(steering,
  FS_FT_NIC_TX);
if (!steering->egress_root_ns)
return -ENOMEM;
 
-   /* create 1 prio*/
-   prio = fs_create_prio(&steering->egress_root_ns->ns, 0, 1);
-   return PTR_ERR_OR_ZERO(prio);
+   if (init_root_tree(steering, &egress_root_fs,
+  &steering->egress_root_ns->ns.node))
+   goto cleanup;
+   set_prio_attrs(steering->egress_root_ns);
+   return 0;
+cleanup:
+   cleanup_root_ns(steering->egress_root_ns);
+   steering->egress_root_ns = NULL;
+   return -ENOMEM;
 }
 
 int mlx5_init_fs(struct mlx5_core_dev *dev)
@@ -2607,7 +2624,7 @@ int mlx5_init_fs(struct mlx5_core_dev *dev)
goto err;
}
 
-   if (MLX5_IPSEC_DEV(dev)) {
+   if (MLX5_IPSEC_DEV(dev) || MLX5_CAP_FLOWTABLE_NIC_TX(dev, ft_support)) {
err = init_egress_root_ns(steering);
if (err)
goto err;
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 0566c6a94805..e9c35eb1cc26 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -1113,6 +1113,12 @@ enum mlx5_qcam_feature_groups {
 #define MLX5_CAP_FLOWTABLE_NIC_RX_MAX(mdev, cap) \
MLX5_CAP_FLOWTABLE_MAX(mdev, flow_table_properties_nic_receive.cap)
 
+#define MLX5_CAP_FLOWTABLE_NIC_TX(mdev, cap) \
+   MLX5_CAP_FLOWTABLE(mdev, flow_table_properties_nic_transmit.cap)
+
+#define MLX5_CAP_FLOWTABLE_NIC_TX_MAX(mdev, cap) \
+   MLX5_CAP_FLOWTABLE_MAX(mdev, 
flow_table_properties_nic_transmit.cap)
+
 #define MLX5_CAP_FLOWTABLE_SNIFFER_RX(mdev, cap) \
MLX5_CAP_FLOWTABLE(mdev, flow_table_properties_nic_receive_sniffer.cap)
 
-- 
2.14.4

[PATCH mlx5-next 01/27] net/mlx5: Cleanup flow namespace getter switch logic

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Refactor the switch logic so it's simpler to follow and understand.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 22 +-
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 0d8378243903..17bbad8ee882 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1976,7 +1976,7 @@ struct mlx5_flow_namespace 
*mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
enum 
mlx5_flow_namespace_type type)
 {
struct mlx5_flow_steering *steering = dev->priv.steering;
-   struct mlx5_flow_root_namespace *root_ns;
+   struct mlx5_flow_root_namespace *steering_ns = NULL;
int prio;
struct fs_prio *fs_prio;
struct mlx5_flow_namespace *ns;
@@ -1992,37 +1992,33 @@ struct mlx5_flow_namespace 
*mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
case MLX5_FLOW_NAMESPACE_KERNEL:
case MLX5_FLOW_NAMESPACE_LEFTOVERS:
case MLX5_FLOW_NAMESPACE_ANCHOR:
+   steering_ns = steering->root_ns;
prio = type;
break;
case MLX5_FLOW_NAMESPACE_FDB:
if (steering->fdb_root_ns)
return &steering->fdb_root_ns->ns;
-   else
-   return NULL;
+   break;
case MLX5_FLOW_NAMESPACE_SNIFFER_RX:
if (steering->sniffer_rx_root_ns)
return &steering->sniffer_rx_root_ns->ns;
-   else
-   return NULL;
+   break;
case MLX5_FLOW_NAMESPACE_SNIFFER_TX:
if (steering->sniffer_tx_root_ns)
return &steering->sniffer_tx_root_ns->ns;
-   else
-   return NULL;
+   break;
case MLX5_FLOW_NAMESPACE_EGRESS:
if (steering->egress_root_ns)
return &steering->egress_root_ns->ns;
-   else
-   return NULL;
+   break;
default:
-   return NULL;
+   break;
}
 
-   root_ns = steering->root_ns;
-   if (!root_ns)
+   if (!steering_ns)
return NULL;
 
-   fs_prio = find_prio(&root_ns->ns, prio);
+   fs_prio = find_prio(&steering_ns->ns, prio);
if (!fs_prio)
return NULL;
 
-- 
2.14.4

[PATCH mlx5-next 03/27] net/mlx5: Export modify header alloc/dealloc functions

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Those functions will be used by the RDMA side to create modify header
actions to be attached to flow steering rules via verbs.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c| 2 ++
 drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 5 -
 include/linux/mlx5/fs.h | 6 ++
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 28c7301e08f4..37bea30b68ac 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -702,6 +702,7 @@ int mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
kfree(in);
return err;
 }
+EXPORT_SYMBOL(mlx5_modify_header_alloc);
 
 void mlx5_modify_header_dealloc(struct mlx5_core_dev *dev, u32 
modify_header_id)
 {
@@ -716,6 +717,7 @@ void mlx5_modify_header_dealloc(struct mlx5_core_dev *dev, 
u32 modify_header_id)
 
mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
 }
+EXPORT_SYMBOL(mlx5_modify_header_dealloc);
 
 static const struct mlx5_flow_cmds mlx5_flow_cmds = {
.create_flow_table = mlx5_cmd_create_flow_table,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 49955117ae36..b076ce14c48c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -177,11 +177,6 @@ int mlx5_encap_alloc(struct mlx5_core_dev *dev,
 u32 *encap_id);
 void mlx5_encap_dealloc(struct mlx5_core_dev *dev, u32 encap_id);
 
-int mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
-u8 namespace, u8 num_actions,
-void *modify_actions, u32 *modify_header_id);
-void mlx5_modify_header_dealloc(struct mlx5_core_dev *dev, u32 
modify_header_id);
-
 bool mlx5_lag_intf_add(struct mlx5_interface *intf, struct mlx5_priv *priv);
 
 int mlx5_query_mtpps(struct mlx5_core_dev *dev, u32 *mtpps, u32 mtpps_size);
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index af0592400499..245f9e80ef92 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -196,4 +196,10 @@ int mlx5_fc_query(struct mlx5_core_dev *dev, struct 
mlx5_fc *counter,
 int mlx5_fs_add_rx_underlay_qpn(struct mlx5_core_dev *dev, u32 underlay_qpn);
 int mlx5_fs_remove_rx_underlay_qpn(struct mlx5_core_dev *dev, u32 
underlay_qpn);
 
+int mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
+u8 namespace, u8 num_actions,
+void *modify_actions, u32 *modify_header_id);
+void mlx5_modify_header_dealloc(struct mlx5_core_dev *dev,
+   u32 modify_header_id);
+
 #endif
-- 
2.14.4

[PATCH mlx5-next 05/27] net/mlx5: Break encap/decap into two separated flow table creation flags

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Today we are able to attach encap and decap actions only to the FDB. In
preparation to enable those actions on the NIC flow tables, break the
single flag into two. Those flags control whatever a decap or encap
operations can be attached to the flow table created. For FDB, if
encapsulation is required, we set both of them.

Signed-off-by: Mark Bloch 
Reviewed-by: Or Gerlitz 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   | 7 ---
 include/linux/mlx5/fs.h| 3 ++-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 8f50ce80ff66..83341c92602e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -524,7 +524,8 @@ static int esw_create_offloads_fast_fdb_table(struct 
mlx5_eswitch *esw)
esw_size >>= 1;
 
if (esw->offloads.encap != DEVLINK_ESWITCH_ENCAP_MODE_NONE)
-   flags |= MLX5_FLOW_TABLE_TUNNEL_EN;
+   flags |= (MLX5_FLOW_TABLE_TUNNEL_EN_ENCAP |
+ MLX5_FLOW_TABLE_TUNNEL_EN_DECAP);
 
fdb = mlx5_create_auto_grouped_flow_table(root_ns, FDB_FAST_PATH,
  esw_size,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 9ae777e56529..1698f325a21e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -152,7 +152,8 @@ static int mlx5_cmd_create_flow_table(struct mlx5_core_dev 
*dev,
  struct mlx5_flow_table *next_ft,
  unsigned int *table_id, u32 flags)
 {
-   int en_encap_decap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN);
+   int en_encap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN_ENCAP);
+   int en_decap = !!(flags & MLX5_FLOW_TABLE_TUNNEL_EN_DECAP);
u32 out[MLX5_ST_SZ_DW(create_flow_table_out)] = {0};
u32 in[MLX5_ST_SZ_DW(create_flow_table_in)]   = {0};
int err;
@@ -169,9 +170,9 @@ static int mlx5_cmd_create_flow_table(struct mlx5_core_dev 
*dev,
}
 
MLX5_SET(create_flow_table_in, in, flow_table_context.decap_en,
-en_encap_decap);
+en_decap);
MLX5_SET(create_flow_table_in, in, flow_table_context.encap_en,
-en_encap_decap);
+en_encap);
 
switch (op_mod) {
case FS_FT_OP_MOD_NORMAL:
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 245f9e80ef92..816cbfa00c3b 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -45,7 +45,8 @@ enum {
 };
 
 enum {
-   MLX5_FLOW_TABLE_TUNNEL_EN = BIT(0),
+   MLX5_FLOW_TABLE_TUNNEL_EN_ENCAP = BIT(0),
+   MLX5_FLOW_TABLE_TUNNEL_EN_DECAP = BIT(1),
 };
 
 #define LEFTOVERS_RULE_NUM  2
-- 
2.14.4

[PATCH mlx5-next 09/27] net/mlx5: Pass a namespace for packet reformat ID allocation

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Currently we attach packet reformat actions only to the FDB namespace.
In preparation to be able to use that for NIC steering, pass the actual
namespace as a parameter.

Signed-off-by: Mark Bloch 
Reviewed-by: Or Gerlitz 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 3 +++
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c| 8 +++-
 drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 1 +
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index ec787e8a0be4..f111bf08f2be 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -925,6 +925,7 @@ void mlx5e_tc_encap_flows_add(struct mlx5e_priv *priv,
 
err = mlx5_packet_reformat_alloc(priv->mdev, e->tunnel_type,
 e->encap_size, e->encap_header,
+MLX5_FLOW_NAMESPACE_FDB,
 &e->encap_id);
if (err) {
mlx5_core_warn(priv->mdev, "Failed to offload cached 
encapsulation header, %d\n",
@@ -2314,6 +2315,7 @@ static int mlx5e_create_encap_header_ipv4(struct 
mlx5e_priv *priv,
 
err = mlx5_packet_reformat_alloc(priv->mdev, e->tunnel_type,
 ipv4_encap_size, encap_header,
+MLX5_FLOW_NAMESPACE_FDB,
 &e->encap_id);
if (err)
goto destroy_neigh_entry;
@@ -2422,6 +2424,7 @@ static int mlx5e_create_encap_header_ipv6(struct 
mlx5e_priv *priv,
 
err = mlx5_packet_reformat_alloc(priv->mdev, e->tunnel_type,
 ipv6_encap_size, encap_header,
+MLX5_FLOW_NAMESPACE_FDB,
 &e->encap_id);
if (err)
goto destroy_neigh_entry;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 4539b709db20..d686668a8d52 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -600,16 +600,22 @@ int mlx5_packet_reformat_alloc(struct mlx5_core_dev *dev,
   int reformat_type,
   size_t size,
   void *reformat_data,
+  int namespace,
   u32 *packet_reformat_id)
 {
-   int max_encap_size = MLX5_CAP_ESW(dev, max_encap_header_size);
u32 out[MLX5_ST_SZ_DW(alloc_packet_reformat_context_out)];
void *packet_reformat_context_in;
+   int max_encap_size;
void *reformat;
int inlen;
int err;
u32 *in;
 
+   if (namespace == MLX5_FLOW_NAMESPACE_FDB)
+   max_encap_size = MLX5_CAP_ESW(dev, max_encap_header_size);
+   else
+   max_encap_size = MLX5_CAP_FLOWTABLE(dev, max_encap_header_size);
+
if (size > max_encap_size) {
mlx5_core_warn(dev, "encap size %zd too big, max supported is 
%d\n",
   size, max_encap_size);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index e16f6e6e03e1..0f3d9942d1a9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -174,6 +174,7 @@ int mlx5_packet_reformat_alloc(struct mlx5_core_dev *dev,
   int reformat_type,
   size_t size,
   void *reformat_data,
+  int namespace,
   u32 *packet_reformat_id);
 void mlx5_packet_reformat_dealloc(struct mlx5_core_dev *dev,
  u32 packet_reformat_id);
-- 
2.14.4

[PATCH mlx5-next 04/27] net/mlx5: Add support for more namespaces when allocating modify header

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

There are RX and TX flow steering namespaces with different number of
actions. Initialize them accordingly.

Signed-off-by: Mark Bloch 
Reviewed-by: Or Gerlitz 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index 37bea30b68ac..9ae777e56529 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -667,9 +667,14 @@ int mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
table_type = FS_FT_FDB;
break;
case MLX5_FLOW_NAMESPACE_KERNEL:
+   case MLX5_FLOW_NAMESPACE_BYPASS:
max_actions = MLX5_CAP_FLOWTABLE_NIC_RX(dev, 
max_modify_header_actions);
table_type = FS_FT_NIC_RX;
break;
+   case MLX5_FLOW_NAMESPACE_EGRESS:
+   max_actions = MLX5_CAP_FLOWTABLE_NIC_TX(dev, 
max_modify_header_actions);
+   table_type = FS_FT_NIC_TX;
+   break;
default:
return -EOPNOTSUPP;
}
-- 
2.14.4

[PATCH mlx5-next 06/27] net/mlx5: Move header encap type to IFC header file

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Those bits are hardware specification and should be defined at the
IFC header file.

Signed-off-by: Mark Bloch 
Reviewed-by: Or Gerlitz 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 5 -
 include/linux/mlx5/mlx5_ifc.h   | 5 +
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 0edf4751a8ba..74601f9d1c28 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -100,11 +100,6 @@ struct mlx5e_tc_flow_parse_attr {
int mirred_ifindex;
 };
 
-enum {
-   MLX5_HEADER_TYPE_VXLAN = 0x0,
-   MLX5_HEADER_TYPE_NVGRE = 0x1,
-};
-
 #define MLX5E_TC_TABLE_NUM_GROUPS 4
 #define MLX5E_TC_TABLE_MAX_GROUP_SIZE BIT(16)
 
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 5e04e2053fd7..1745366ee5b7 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -4843,6 +4843,11 @@ struct mlx5_ifc_alloc_encap_header_out_bits {
u8 reserved_at_60[0x20];
 };
 
+enum {
+   MLX5_HEADER_TYPE_VXLAN = 0x0,
+   MLX5_HEADER_TYPE_NVGRE = 0x1,
+};
+
 struct mlx5_ifc_alloc_encap_header_in_bits {
u8 opcode[0x10];
u8 reserved_at_10[0x10];
-- 
2.14.4

[PATCH mlx5-next 07/27] {net, RDMA}/mlx5: Rename encap to reformat packet

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Renames all encap mlx5_{core,ib} code to use the new naming of packet
reformat. No functional change is introduced. This is done because the
original naming didn't reflect correctly the operation being done by
this action. For example not only can we encapsulate a packet,
but also decapsulate it.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/devx.c  |  6 +--
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c  |  8 +--
 .../mellanox/mlx5/core/diag/fs_tracepoint.h|  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 42 ---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  2 +-
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |  8 +--
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   | 63 --
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h| 13 ++---
 include/linux/mlx5/fs.h|  4 +-
 include/linux/mlx5/mlx5_ifc.h  | 50 -
 11 files changed, 106 insertions(+), 94 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/devx.c 
b/drivers/infiniband/hw/mlx5/devx.c
index c9a7a12a8c13..3da36fa7024e 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -284,7 +284,7 @@ static bool devx_is_obj_create_cmd(const void *in)
case MLX5_CMD_OP_CREATE_FLOW_TABLE:
case MLX5_CMD_OP_CREATE_FLOW_GROUP:
case MLX5_CMD_OP_ALLOC_FLOW_COUNTER:
-   case MLX5_CMD_OP_ALLOC_ENCAP_HEADER:
+   case MLX5_CMD_OP_ALLOC_PACKET_REFORMAT_CONTEXT:
case MLX5_CMD_OP_ALLOC_MODIFY_HEADER_CONTEXT:
case MLX5_CMD_OP_CREATE_SCHEDULING_ELEMENT:
case MLX5_CMD_OP_ADD_VXLAN_UDP_DPORT:
@@ -627,9 +627,9 @@ static void devx_obj_build_destroy_cmd(void *in, void *out, 
void *din,
MLX5_SET(general_obj_in_cmd_hdr, din, opcode,
 MLX5_CMD_OP_DEALLOC_FLOW_COUNTER);
break;
-   case MLX5_CMD_OP_ALLOC_ENCAP_HEADER:
+   case MLX5_CMD_OP_ALLOC_PACKET_REFORMAT_CONTEXT:
MLX5_SET(general_obj_in_cmd_hdr, din, opcode,
-MLX5_CMD_OP_DEALLOC_ENCAP_HEADER);
+MLX5_CMD_OP_DEALLOC_PACKET_REFORMAT_CONTEXT);
break;
case MLX5_CMD_OP_ALLOC_MODIFY_HEADER_CONTEXT:
MLX5_SET(general_obj_in_cmd_hdr, din, opcode,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index 381dbfa6a68e..694ed2afa7ed 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -308,7 +308,7 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev 
*dev, u16 op,
case MLX5_CMD_OP_MODIFY_FLOW_TABLE:
case MLX5_CMD_OP_SET_FLOW_TABLE_ENTRY:
case MLX5_CMD_OP_SET_FLOW_TABLE_ROOT:
-   case MLX5_CMD_OP_DEALLOC_ENCAP_HEADER:
+   case MLX5_CMD_OP_DEALLOC_PACKET_REFORMAT_CONTEXT:
case MLX5_CMD_OP_DEALLOC_MODIFY_HEADER_CONTEXT:
case MLX5_CMD_OP_FPGA_DESTROY_QP:
case MLX5_CMD_OP_DESTROY_GENERAL_OBJECT:
@@ -426,7 +426,7 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev 
*dev, u16 op,
case MLX5_CMD_OP_QUERY_FLOW_TABLE_ENTRY:
case MLX5_CMD_OP_ALLOC_FLOW_COUNTER:
case MLX5_CMD_OP_QUERY_FLOW_COUNTER:
-   case MLX5_CMD_OP_ALLOC_ENCAP_HEADER:
+   case MLX5_CMD_OP_ALLOC_PACKET_REFORMAT_CONTEXT:
case MLX5_CMD_OP_ALLOC_MODIFY_HEADER_CONTEXT:
case MLX5_CMD_OP_FPGA_CREATE_QP:
case MLX5_CMD_OP_FPGA_MODIFY_QP:
@@ -599,8 +599,8 @@ const char *mlx5_command_str(int command)
MLX5_COMMAND_STR_CASE(DEALLOC_FLOW_COUNTER);
MLX5_COMMAND_STR_CASE(QUERY_FLOW_COUNTER);
MLX5_COMMAND_STR_CASE(MODIFY_FLOW_TABLE);
-   MLX5_COMMAND_STR_CASE(ALLOC_ENCAP_HEADER);
-   MLX5_COMMAND_STR_CASE(DEALLOC_ENCAP_HEADER);
+   MLX5_COMMAND_STR_CASE(ALLOC_PACKET_REFORMAT_CONTEXT);
+   MLX5_COMMAND_STR_CASE(DEALLOC_PACKET_REFORMAT_CONTEXT);
MLX5_COMMAND_STR_CASE(ALLOC_MODIFY_HEADER_CONTEXT);
MLX5_COMMAND_STR_CASE(DEALLOC_MODIFY_HEADER_CONTEXT);
MLX5_COMMAND_STR_CASE(FPGA_CREATE_QP);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
index 0240aee9189e..e83dda441a81 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
@@ -133,7 +133,7 @@ TRACE_EVENT(mlx5_fs_del_fg,
{MLX5_FLOW_CONTEXT_ACTION_DROP,  "DROP"},\
{MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,  "FWD"},\
{MLX5_FLOW_CONTEXT_ACTION_COUNT, "CNT"},\
-   {MLX5_FLOW_CONTEXT_ACTION_ENCAP, "ENCAP"},\
+   {MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT, "REFORMAT"},\
{MLX5_FLOW_CONTEXT_ACTION_DE

[PATCH mlx5-next 10/27] net/mlx5: Export packet reformat alloc/dealloc functions

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

This will allow for the RDMA side to allocate packet reformat context.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c| 2 ++
 drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 8 
 include/linux/mlx5/fs.h | 9 +
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index d686668a8d52..eb91cbf42ce7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -651,6 +651,7 @@ int mlx5_packet_reformat_alloc(struct mlx5_core_dev *dev,
kfree(in);
return err;
 }
+EXPORT_SYMBOL(mlx5_packet_reformat_alloc);
 
 void mlx5_packet_reformat_dealloc(struct mlx5_core_dev *dev,
  u32 packet_reformat_id)
@@ -666,6 +667,7 @@ void mlx5_packet_reformat_dealloc(struct mlx5_core_dev *dev,
 
mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
 }
+EXPORT_SYMBOL(mlx5_packet_reformat_dealloc);
 
 int mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
 u8 namespace, u8 num_actions,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 0f3d9942d1a9..4b1b505b20e0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -170,14 +170,6 @@ struct mlx5_core_dev *mlx5_get_next_phys_dev(struct 
mlx5_core_dev *dev);
 void mlx5_dev_list_lock(void);
 void mlx5_dev_list_unlock(void);
 int mlx5_dev_list_trylock(void);
-int mlx5_packet_reformat_alloc(struct mlx5_core_dev *dev,
-  int reformat_type,
-  size_t size,
-  void *reformat_data,
-  int namespace,
-  u32 *packet_reformat_id);
-void mlx5_packet_reformat_dealloc(struct mlx5_core_dev *dev,
- u32 packet_reformat_id);
 
 bool mlx5_lag_intf_add(struct mlx5_interface *intf, struct mlx5_priv *priv);
 
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 3bd5ad3fa28a..1068bc8ce7fa 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -203,4 +203,13 @@ int mlx5_modify_header_alloc(struct mlx5_core_dev *dev,
 void mlx5_modify_header_dealloc(struct mlx5_core_dev *dev,
u32 modify_header_id);
 
+int mlx5_packet_reformat_alloc(struct mlx5_core_dev *dev,
+  int reformat_type,
+  size_t size,
+  void *reformat_data,
+  int namespace,
+  u32 *packet_reformat_id);
+void mlx5_packet_reformat_dealloc(struct mlx5_core_dev *dev,
+ u32 packet_reformat_id);
+
 #endif
-- 
2.14.4

[PATCH rdma-next 11/27] RDMA/mlx5: Add NIC TX steering support

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Add support for proper NIC TX (egress) steering with multiple priorities.
We expose the same number of priorities as the bypass (NIC RX) steering.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/main.c| 28 ++--
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  1 +
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 9011187ca081..b3281e408d2a 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2883,7 +2883,7 @@ is_valid_esp_aes_gcm(struct mlx5_core_dev *mdev,
 * rules would be supported, always return VALID_SPEC_NA.
 */
if (!is_crypto)
-   return egress ? VALID_SPEC_INVALID : VALID_SPEC_NA;
+   return VALID_SPEC_NA;
 
return is_crypto && is_ipsec &&
(!egress || (!is_drop && !flow_act->has_flow_tag)) ?
@@ -3058,21 +3058,26 @@ static struct mlx5_ib_flow_prio *get_flow_table(struct 
mlx5_ib_dev *dev,
max_table_size = BIT(MLX5_CAP_FLOWTABLE_NIC_RX(dev->mdev,
   log_max_ft_size));
if (flow_attr->type == IB_FLOW_ATTR_NORMAL) {
-   if (ft_type == MLX5_IB_FT_TX)
-   priority = 0;
-   else if (flow_is_multicast_only(flow_attr) &&
-!dont_trap)
+   enum mlx5_flow_namespace_type fn_type;
+
+   if (flow_is_multicast_only(flow_attr) &&
+   !dont_trap)
priority = MLX5_IB_FLOW_MCAST_PRIO;
else
priority = ib_prio_to_core_prio(flow_attr->priority,
dont_trap);
-   ns = mlx5_get_flow_namespace(dev->mdev,
-ft_type == MLX5_IB_FT_TX ?
-MLX5_FLOW_NAMESPACE_EGRESS :
-MLX5_FLOW_NAMESPACE_BYPASS);
+   if (ft_type == MLX5_IB_FT_RX) {
+   fn_type = MLX5_FLOW_NAMESPACE_BYPASS;
+   prio = &dev->flow_db->prios[priority];
+   } else {
+   max_table_size = 
BIT(MLX5_CAP_FLOWTABLE_NIC_TX(dev->mdev,
+  
log_max_ft_size));
+   fn_type = MLX5_FLOW_NAMESPACE_EGRESS;
+   prio = &dev->flow_db->egress_prios[priority];
+   }
+   ns = mlx5_get_flow_namespace(dev->mdev, fn_type);
num_entries = MLX5_FS_MAX_ENTRIES;
num_groups = MLX5_FS_MAX_TYPES;
-   prio = &dev->flow_db->prios[priority];
} else if (flow_attr->type == IB_FLOW_ATTR_ALL_DEFAULT ||
   flow_attr->type == IB_FLOW_ATTR_MC_DEFAULT) {
ns = mlx5_get_flow_namespace(dev->mdev,
@@ -3271,6 +3276,9 @@ static struct mlx5_ib_flow_handler 
*_create_flow_rule(struct mlx5_ib_dev *dev,
if (!is_valid_attr(dev->mdev, flow_attr))
return ERR_PTR(-EINVAL);
 
+   if (dev->rep && is_egress)
+   return ERR_PTR(-EINVAL);
+
spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
handler = kzalloc(sizeof(*handler), GFP_KERNEL);
if (!handler || !spec) {
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 462505c8fa25..01bef4c2d396 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -188,6 +188,7 @@ struct mlx5_ib_flow_matcher {
 
 struct mlx5_ib_flow_db {
struct mlx5_ib_flow_prioprios[MLX5_IB_NUM_FLOW_FT];
+   struct mlx5_ib_flow_prioegress_prios[MLX5_IB_NUM_FLOW_FT];
struct mlx5_ib_flow_priosniffer[MLX5_IB_NUM_SNIFFER_FTS];
struct mlx5_ib_flow_prioegress[MLX5_IB_NUM_EGRESS_FTS];
struct mlx5_flow_table  *lag_demux_ft;
-- 
2.14.4

[PATCH rdma-next 13/27] RDMA/mlx5: Add a new flow action verb, modify header

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Expose the ability to create a flow action which mutates packet
headers. The data passed from userspace should be modify header actions
as defined by Mellanox's PRM.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/flow.c | 128 ++
 drivers/infiniband/hw/mlx5/main.c |   5 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  15 
 include/uapi/rdma/mlx5_user_ioctl_cmds.h  |  10 +++
 include/uapi/rdma/mlx5_user_ioctl_verbs.h |   5 ++
 5 files changed, 162 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/flow.c 
b/drivers/infiniband/hw/mlx5/flow.c
index ee398a9b5f26..2c7d75cb8ade 100644
--- a/drivers/infiniband/hw/mlx5/flow.c
+++ b/drivers/infiniband/hw/mlx5/flow.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -177,6 +178,114 @@ static int 
UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_MATCHER_CREATE)(
return err;
 }
 
+static int mlx5_ib_ft_type_to_namespace(u8 table_type, u8 *namespace)
+{
+   switch (table_type) {
+   case MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_RX:
+   *namespace = MLX5_FLOW_NAMESPACE_BYPASS;
+   break;
+   case MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_TX:
+   *namespace = MLX5_FLOW_NAMESPACE_EGRESS;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+void mlx5_ib_destroy_flow_action_raw(struct mlx5_ib_flow_action *maction)
+{
+   switch (maction->flow_action_raw.sub_type) {
+   case MLX5_IB_FLOW_ACTION_MODIFY_HEADER:
+   mlx5_modify_header_dealloc(maction->flow_action_raw.dev->mdev,
+  maction->flow_action_raw.action_id);
+   break;
+   default:
+   WARN_ON(true);
+   break;
+   }
+}
+
+static struct ib_flow_action *
+mlx5_ib_create_modify_header(struct mlx5_ib_dev *dev, u8 ft_type,
+u8 num_actions, void *in)
+{
+   struct mlx5_ib_flow_action *maction;
+   u8 namespace;
+   int ret;
+
+   ret = mlx5_ib_ft_type_to_namespace(ft_type, &namespace);
+   if (ret)
+   return ERR_PTR(-EINVAL);
+
+   maction = kzalloc(sizeof(*maction), GFP_KERNEL);
+   if (!maction)
+   return ERR_PTR(-ENOMEM);
+
+   ret = mlx5_modify_header_alloc(dev->mdev, namespace, num_actions, in,
+  &maction->flow_action_raw.action_id);
+
+   if (ret) {
+   kfree(maction);
+   return ERR_PTR(ret);
+   }
+   maction->flow_action_raw.sub_type =
+   MLX5_IB_FLOW_ACTION_MODIFY_HEADER;
+   maction->flow_action_raw.dev = dev;
+
+   return &maction->ib_action;
+}
+
+static bool mlx5_ib_modify_header_supported(struct mlx5_ib_dev *dev)
+{
+   return MLX5_CAP_FLOWTABLE_NIC_RX(dev->mdev, max_modify_header_actions) 
||
+  MLX5_CAP_FLOWTABLE_NIC_TX(dev->mdev, max_modify_header_actions);
+}
+
+static int 
UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_ACTION_CREATE_MODIFY_HEADER)(struct 
ib_device *ib_dev,
+  
struct ib_uverbs_file *file,
+  
struct uverbs_attr_bundle *attrs)
+{
+   struct ib_uobject *uobj = uverbs_attr_get_uobject(attrs,
+   
MLX5_IB_ATTR_CREATE_MODIFY_HEADER_HANDLE);
+   struct mlx5_ib_dev *mdev = to_mdev(uobj->context->device);
+   enum mlx5_ib_uapi_flow_table_type ft_type;
+   struct ib_flow_action *action;
+   void *in;
+   int len;
+   int ret;
+
+   if (!mlx5_ib_modify_header_supported(mdev))
+   return -EOPNOTSUPP;
+
+   in = uverbs_attr_get_alloced_ptr(attrs,
+
MLX5_IB_ATTR_CREATE_MODIFY_HEADER_ACTIONS_PRM);
+   len = uverbs_attr_get_len(attrs, 
MLX5_IB_ATTR_CREATE_MODIFY_HEADER_ACTIONS_PRM);
+
+   if (len % MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto))
+   return -EINVAL;
+
+   ret = uverbs_get_const(&ft_type, attrs,
+  MLX5_IB_ATTR_CREATE_MODIFY_HEADER_FT_TYPE);
+   if (ret)
+   return -EINVAL;
+
+   action = mlx5_ib_create_modify_header(mdev, ft_type,
+ len / 
MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto),
+ in);
+   if (IS_ERR(action))
+   return PTR_ERR(action);
+
+   atomic_set(&action->usecnt, 0);
+   action->device = uobj->context->device;
+   action->type = IB_FLOW_ACTION_UNSPECIFIED;
+   action->uobject = uobj;
+   uobj->object = action;
+
+   return 0;
+}
+
 DECLARE_UVERBS_NAMED_METHOD(
MLX5_IB_METHOD_CREATE_FLOW,
UVERBS_ATTR_IDR(MLX5_IB_ATTR_CREATE_FLOW_HANDLE,
@@ -211,6 +320,24 @

[PATCH rdma-next 12/27] RDMA/uverbs: Add UVERBS_ATTR_CONST_IN to the specs language

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

This makes it clear and safe to access constants past in from user space.
We define a consistent ABI of u64 for all constants, and verify that
the data passed in can be represented by the type the user supplies.

The expectation is this will always be used with an enum declaring the
constant values, and the user will use the enum type as input to the
accessor.

Signed-off-by: Jason Gunthorpe 
Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/uverbs_ioctl.c | 21 +
 include/rdma/uverbs_ioctl.h| 30 ++
 2 files changed, 51 insertions(+)

diff --git a/drivers/infiniband/core/uverbs_ioctl.c 
b/drivers/infiniband/core/uverbs_ioctl.c
index 23a1777f26e2..3ad3b69e32ab 100644
--- a/drivers/infiniband/core/uverbs_ioctl.c
+++ b/drivers/infiniband/core/uverbs_ioctl.c
@@ -537,3 +537,24 @@ int uverbs_get_flags32(u32 *to, const struct 
uverbs_attr_bundle *attrs_bundle,
return 0;
 }
 EXPORT_SYMBOL(uverbs_get_flags32);
+
+int _uverbs_get_const(s64 *to, const struct uverbs_attr_bundle *attrs_bundle,
+ size_t idx, s64 lower_bound, u64 upper_bound)
+
+{
+   const struct uverbs_attr *attr;
+
+   attr = uverbs_attr_get(attrs_bundle, idx);
+   if (IS_ERR(attr))
+   return PTR_ERR(attr);
+
+   WARN_ON(attr->ptr_attr.len != 8);
+
+   *to = attr->ptr_attr.data;
+
+   if (*to < lower_bound || *to > upper_bound)
+   return -EINVAL;
+
+   return 0;
+}
+EXPORT_SYMBOL(_uverbs_get_const);
diff --git a/include/rdma/uverbs_ioctl.h b/include/rdma/uverbs_ioctl.h
index 19b421d2d82a..f703c8ebbb02 100644
--- a/include/rdma/uverbs_ioctl.h
+++ b/include/rdma/uverbs_ioctl.h
@@ -268,6 +268,15 @@ struct uverbs_object_tree_def {
  __VA_ARGS__ },   \
})
 
+/* An input value that is a member in the enum _enum_type. */
+#define UVERBS_ATTR_CONST_IN(_attr_id, _enum_type, ...)
\
+   UVERBS_ATTR_PTR_IN(\
+   _attr_id,  \
+   UVERBS_ATTR_SIZE(sizeof(u64) + BUILD_BUG_ON_ZERO(  \
+   !sizeof(_enum_type)),  \
+sizeof(u64)), \
+   __VA_ARGS__)
+
 /*
  * An input value that is a bitwise combination of values of _enum_type.
  * This permits the flag value to be passed as either a u32 or u64, it must
@@ -536,6 +545,27 @@ int uverbs_get_flags64(u64 *to, const struct 
uverbs_attr_bundle *attrs_bundle,
   size_t idx, u64 allowed_bits);
 int uverbs_get_flags32(u32 *to, const struct uverbs_attr_bundle *attrs_bundle,
   size_t idx, u64 allowed_bits);
+#if IS_ENABLED(CONFIG_INFINIBAND_USER_ACCESS)
+int _uverbs_get_const(s64 *to, const struct uverbs_attr_bundle *attrs_bundle,
+ size_t idx, s64 lower_bound, u64 upper_bound);
+#else
+static inline int
+_uverbs_get_const(s64 *to, const struct uverbs_attr_bundle *attrs_bundle,
+ size_t idx, s64 lower_bound, u64 upper_bound)
+{
+   return -EINVAL;
+}
+#endif
+
+#define uverbs_get_const(_to, _attrs_bundle, _idx) 
\
+   ({ \
+   s64 _val;  \
+   int _ret = _uverbs_get_const(&_val, _attrs_bundle, _idx,   \
+type_min(typeof(*_to)),   \
+type_max(typeof(*_to)));  \
+   (*_to) = _val; \
+   _ret;  \
+   })
 
 /* =
  *  Definitions -> Specs infrastructure
-- 
2.14.4

[PATCH rdma-next 14/27] RDMA/mlx5: Enable attaching modify header to steering flows

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

When creating a flow steering rule, allow the user to attach a modify
header action.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/main.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b56ac6614be6..473c8e5d21a5 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2466,6 +2466,14 @@ static int parse_flow_flow_action(const union 
ib_flow_spec *ib_spec,
MLX5_FLOW_CONTEXT_ACTION_ENCRYPT :
MLX5_FLOW_CONTEXT_ACTION_DECRYPT;
return 0;
+   case IB_FLOW_ACTION_UNSPECIFIED:
+   if (maction->flow_action_raw.sub_type ==
+   MLX5_IB_FLOW_ACTION_MODIFY_HEADER) {
+   action->action |= MLX5_FLOW_CONTEXT_ACTION_MOD_HDR;
+   action->modify_id = maction->flow_action_raw.action_id;
+   return 0;
+   }
+   /* fall through */
default:
return -EOPNOTSUPP;
}
-- 
2.14.4

[PATCH rdma-next 15/27] RDMA/mlx5: Enable decap and packet reformat on flow tables

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

If NIC RX flow tables support decap operation, enable it on creation.
If NIC TX flow tables support reformat operation, enable it on creation.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/main.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 473c8e5d21a5..d826d7b21c2e 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3034,14 +3034,15 @@ enum flow_table_type {
 static struct mlx5_ib_flow_prio *_get_prio(struct mlx5_flow_namespace *ns,
   struct mlx5_ib_flow_prio *prio,
   int priority,
-  int num_entries, int num_groups)
+  int num_entries, int num_groups,
+  u32 flags)
 {
struct mlx5_flow_table *ft;
 
ft = mlx5_create_auto_grouped_flow_table(ns, priority,
 num_entries,
 num_groups,
-0, 0);
+0, flags);
if (IS_ERR(ft))
return ERR_CAST(ft);
 
@@ -3061,6 +3062,7 @@ static struct mlx5_ib_flow_prio *get_flow_table(struct 
mlx5_ib_dev *dev,
int max_table_size;
int num_entries;
int num_groups;
+   u32 flags = 0;
int priority;
 
max_table_size = BIT(MLX5_CAP_FLOWTABLE_NIC_RX(dev->mdev,
@@ -3077,11 +3079,17 @@ static struct mlx5_ib_flow_prio *get_flow_table(struct 
mlx5_ib_dev *dev,
if (ft_type == MLX5_IB_FT_RX) {
fn_type = MLX5_FLOW_NAMESPACE_BYPASS;
prio = &dev->flow_db->prios[priority];
+   if (!dev->rep &&
+   MLX5_CAP_FLOWTABLE_NIC_RX(dev->mdev, decap))
+   flags |= MLX5_FLOW_TABLE_TUNNEL_EN_DECAP;
} else {
max_table_size = 
BIT(MLX5_CAP_FLOWTABLE_NIC_TX(dev->mdev,
   
log_max_ft_size));
fn_type = MLX5_FLOW_NAMESPACE_EGRESS;
prio = &dev->flow_db->egress_prios[priority];
+   if (!dev->rep && MLX5_CAP_FLOWTABLE_NIC_TX(dev->mdev,
+  reformat))
+   flags |= MLX5_FLOW_TABLE_TUNNEL_EN_REFORMAT;
}
ns = mlx5_get_flow_namespace(dev->mdev, fn_type);
num_entries = MLX5_FS_MAX_ENTRIES;
@@ -3117,7 +3125,8 @@ static struct mlx5_ib_flow_prio *get_flow_table(struct 
mlx5_ib_dev *dev,
 
ft = prio->flow_table;
if (!ft)
-   return _get_prio(ns, prio, priority, num_entries, num_groups);
+   return _get_prio(ns, prio, priority, num_entries, num_groups,
+flags);
 
return prio;
 }
@@ -3695,7 +3704,7 @@ static struct mlx5_ib_flow_prio *_get_flow_table(struct 
mlx5_ib_dev *dev,
return prio;
 
return _get_prio(ns, prio, priority, MLX5_FS_MAX_ENTRIES,
-MLX5_FS_MAX_TYPES);
+MLX5_FS_MAX_TYPES, 0);
 }
 
 static struct mlx5_ib_flow_handler *
-- 
2.14.4

[PATCH rdma-next 20/27] RDMA/mlx5: Enable reformat on NIC RX if supported

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

A L3_TUNNEL_TO_L2 decap flow action requires to enable the encap bit on
the Flow table, enable it if supported. This will allow to attach those
flow actions to NIC RX steering. We don't enable if running on a
representor.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/main.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 9c42a1059590..b1c7cf26e206 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3087,6 +3087,10 @@ static struct mlx5_ib_flow_prio *get_flow_table(struct 
mlx5_ib_dev *dev,
if (!dev->rep &&
MLX5_CAP_FLOWTABLE_NIC_RX(dev->mdev, decap))
flags |= MLX5_FLOW_TABLE_TUNNEL_EN_DECAP;
+   if (!dev->rep &&
+   MLX5_CAP_FLOWTABLE_NIC_RX(dev->mdev,
+ reformat_l3_tunnel_to_l2))
+   flags |= MLX5_FLOW_TABLE_TUNNEL_EN_REFORMAT;
} else {
max_table_size = 
BIT(MLX5_CAP_FLOWTABLE_NIC_TX(dev->mdev,
   
log_max_ft_size));
-- 
2.14.4

[PATCH rdma-next 19/27] RDMA/mlx5: Extend packet reformat verbs

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

We expose new actions:

L2_TO_L2_TUNNEL - A generic encap from L2 to L2, the data passed should
be the encapsulating headers.

L3_TUNNEL_TO_L2 - Will do decap where the inner packet starts from L3,
the data should be mac or mac + vlan (14 or 18 bytes).

L2_TO_L3_TUNNEL - Will do encap where is L2 of the original packet will
not be included, the data should be the encapsulating header.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/flow.c | 95 +++
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  1 +
 include/uapi/rdma/mlx5_user_ioctl_cmds.h  |  1 +
 include/uapi/rdma/mlx5_user_ioctl_verbs.h |  3 +
 4 files changed, 100 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/flow.c 
b/drivers/infiniband/hw/mlx5/flow.c
index 0fea98cb7d42..a6b4f37a5359 100644
--- a/drivers/infiniband/hw/mlx5/flow.c
+++ b/drivers/infiniband/hw/mlx5/flow.c
@@ -202,6 +202,10 @@ void mlx5_ib_destroy_flow_action_raw(struct 
mlx5_ib_flow_action *maction)
mlx5_modify_header_dealloc(maction->flow_action_raw.dev->mdev,
   maction->flow_action_raw.action_id);
break;
+   case MLX5_IB_FLOW_ACTION_PACKET_REFORMAT:
+   mlx5_packet_reformat_dealloc(maction->flow_action_raw.dev->mdev,
+
maction->flow_action_raw.action_id);
+   break;
case MLX5_IB_FLOW_ACTION_DECAP:
break;
default:
@@ -291,6 +295,21 @@ static bool 
mlx5_ib_flow_action_packet_reformat_valid(struct mlx5_ib_dev *ibdev,
  u8 ft_type)
 {
switch (packet_reformat_type) {
+   case MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L2_TO_L2_TUNNEL:
+   if (ft_type == MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_TX)
+   return MLX5_CAP_FLOWTABLE(ibdev->mdev,
+ encap_general_header);
+   break;
+   case MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L2_TO_L3_TUNNEL:
+   if (ft_type == MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_TX)
+   return MLX5_CAP_FLOWTABLE_NIC_TX(ibdev->mdev,
+
reformat_l2_to_l3_tunnel);
+   break;
+   case MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L3_TUNNEL_TO_L2:
+   if (ft_type == MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_RX)
+   return MLX5_CAP_FLOWTABLE_NIC_RX(ibdev->mdev,
+
reformat_l3_tunnel_to_l2);
+   break;
case MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L2_TUNNEL_TO_L2:
if (ft_type == MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_RX)
return MLX5_CAP_FLOWTABLE_NIC_RX(ibdev->mdev, decap);
@@ -302,6 +321,55 @@ static bool 
mlx5_ib_flow_action_packet_reformat_valid(struct mlx5_ib_dev *ibdev,
return false;
 }
 
+static int mlx5_ib_dv_to_prm_packet_reforamt_type(u8 dv_prt, u8 *prm_prt)
+{
+   switch (dv_prt) {
+   case MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L2_TO_L2_TUNNEL:
+   *prm_prt = MLX5_REFORMAT_TYPE_L2_TO_L2_TUNNEL;
+   break;
+   case MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L3_TUNNEL_TO_L2:
+   *prm_prt = MLX5_REFORMAT_TYPE_L3_TUNNEL_TO_L2;
+   break;
+   case MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L2_TO_L3_TUNNEL:
+   *prm_prt = MLX5_REFORMAT_TYPE_L2_TO_L3_TUNNEL;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int mlx5_ib_flow_action_create_packet_reformat_ctx(struct mlx5_ib_dev 
*dev,
+ struct 
mlx5_ib_flow_action *maction,
+ u8 ft_type, u8 dv_prt,
+ void *in, size_t len)
+{
+   u8 namespace;
+   u8 prm_prt;
+   int ret;
+
+   ret = mlx5_ib_ft_type_to_namespace(ft_type, &namespace);
+   if (ret)
+   return ret;
+
+   ret = mlx5_ib_dv_to_prm_packet_reforamt_type(dv_prt, &prm_prt);
+   if (ret)
+   return ret;
+
+   ret = mlx5_packet_reformat_alloc(dev->mdev, prm_prt, len,
+in, namespace,
+&maction->flow_action_raw.action_id);
+   if (ret)
+   return ret;
+
+   maction->flow_action_raw.sub_type =
+   MLX5_IB_FLOW_ACTION_PACKET_REFORMAT;
+   maction->flow_action_raw.dev = dev;
+
+   return 0;
+}
+
 static int 
UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_ACTION_CREATE_PACKET_REFORMAT)(struct 
ib_device *ib_dev,
 
struct ib_uverbs_file *file,

[PATCH rdma-next 18/27] RDMA/mlx5: Enable attaching DECAP action to steering flows

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Any matching flow will be stripped of it's VXLAN tunnel, only the inner
L2 onward is left.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/main.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index d826d7b21c2e..9c42a1059590 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2473,6 +2473,11 @@ static int parse_flow_flow_action(const union 
ib_flow_spec *ib_spec,
action->modify_id = maction->flow_action_raw.action_id;
return 0;
}
+   if (maction->flow_action_raw.sub_type ==
+   MLX5_IB_FLOW_ACTION_DECAP) {
+   action->action |= MLX5_FLOW_CONTEXT_ACTION_DECAP;
+   return 0;
+   }
/* fall through */
default:
return -EOPNOTSUPP;
-- 
2.14.4

[PATCH rdma-next 21/27] RDMA/mlx5: Enable attaching packet reformat action to steering flows

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Any matching rules will be mutated based on the packet reformat context
which is attached to that given flow rule.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/main.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index b1c7cf26e206..0af0bdc5804b 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2478,6 +2478,13 @@ static int parse_flow_flow_action(const union 
ib_flow_spec *ib_spec,
action->action |= MLX5_FLOW_CONTEXT_ACTION_DECAP;
return 0;
}
+   if (maction->flow_action_raw.sub_type ==
+   MLX5_IB_FLOW_ACTION_PACKET_REFORMAT) {
+   action->action |=
+   MLX5_FLOW_CONTEXT_ACTION_PACKET_REFORMAT;
+   action->reformat_id = 
maction->flow_action_raw.action_id;
+   return 0;
+   }
/* fall through */
default:
return -EOPNOTSUPP;
-- 
2.14.4

[PATCH rdma-next 16/27] RDMA/uverbs: Add generic function to fill in flow action object

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Refactor the initialization of a flow action object to a common function.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/uverbs_std_types_flow_action.c |  7 ++-
 drivers/infiniband/hw/mlx5/flow.c  |  8 +++-
 include/rdma/uverbs_std_types.h| 12 
 3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_std_types_flow_action.c 
b/drivers/infiniband/core/uverbs_std_types_flow_action.c
index adb9209c4710..8beacfdb9f27 100644
--- a/drivers/infiniband/core/uverbs_std_types_flow_action.c
+++ b/drivers/infiniband/core/uverbs_std_types_flow_action.c
@@ -327,11 +327,8 @@ static int 
UVERBS_HANDLER(UVERBS_METHOD_FLOW_ACTION_ESP_CREATE)(struct ib_device
if (IS_ERR(action))
return PTR_ERR(action);
 
-   atomic_set(&action->usecnt, 0);
-   action->device = ib_dev;
-   action->type = IB_FLOW_ACTION_ESP;
-   action->uobject = uobj;
-   uobj->object = action;
+   uverbs_flow_action_fill_action(action, uobj, ib_dev,
+  IB_FLOW_ACTION_ESP);
 
return 0;
 }
diff --git a/drivers/infiniband/hw/mlx5/flow.c 
b/drivers/infiniband/hw/mlx5/flow.c
index 2c7d75cb8ade..d0325e468801 100644
--- a/drivers/infiniband/hw/mlx5/flow.c
+++ b/drivers/infiniband/hw/mlx5/flow.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -277,11 +278,8 @@ static int 
UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_ACTION_CREATE_MODIFY_HEADER)(struc
if (IS_ERR(action))
return PTR_ERR(action);
 
-   atomic_set(&action->usecnt, 0);
-   action->device = uobj->context->device;
-   action->type = IB_FLOW_ACTION_UNSPECIFIED;
-   action->uobject = uobj;
-   uobj->object = action;
+   uverbs_flow_action_fill_action(action, uobj, uobj->context->device,
+  IB_FLOW_ACTION_UNSPECIFIED);
 
return 0;
 }
diff --git a/include/rdma/uverbs_std_types.h b/include/rdma/uverbs_std_types.h
index 076f085d2dcf..3686da497cf6 100644
--- a/include/rdma/uverbs_std_types.h
+++ b/include/rdma/uverbs_std_types.h
@@ -125,5 +125,17 @@ static inline struct ib_uobject *__uobj_alloc(const struct 
uverbs_obj_type *type
 
 #define uobj_alloc(_type, _ufile) __uobj_alloc(uobj_get_type(_type), _ufile)
 
+static inline void uverbs_flow_action_fill_action(struct ib_flow_action 
*action,
+ struct ib_uobject *uobj,
+ struct ib_device *ib_dev,
+ enum ib_flow_action_type type)
+{
+   atomic_set(&action->usecnt, 0);
+   action->device = ib_dev;
+   action->type = type;
+   action->uobject = uobj;
+   uobj->object = action;
+}
+
 #endif
 
-- 
2.14.4

[PATCH rdma-next 22/27] IB/uverbs: Add IDRs array attribute type to ioctl() interface

2018-07-29 Thread Leon Romanovsky

From: Guy Levi 

Methods sometimes need to get a flexible set of idrs and not a strict
set as can be achieved today by the conventional idr attribute. This is
an idrs-array-like behavior.
Since this may be popular used, we add a new IDRS_ARRAY attribute to the
generic uverbs ioctl layer.

This attribute is embedded in methods, like any other attributes we
currently have. IDRS_ARRAY points to array of idrs of the same object
type and same access rights (only write and read are supported). It
is defined with minimum and maximum length to be enforced and can be
defined as mandatory attribute.

Signed-off-by: Guy Levi 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/uverbs_ioctl.c   | 94 +++-
 include/rdma/uverbs_ioctl.h  | 68 ++-
 include/uapi/rdma/rdma_user_ioctl_cmds.h |  2 +-
 3 files changed, 160 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_ioctl.c 
b/drivers/infiniband/core/uverbs_ioctl.c
index 3ad3b69e32ab..ce40f97dbbf2 100644
--- a/drivers/infiniband/core/uverbs_ioctl.c
+++ b/drivers/infiniband/core/uverbs_ioctl.c
@@ -46,6 +46,79 @@ static bool uverbs_is_attr_cleared(const struct 
ib_uverbs_attr *uattr,
   0, uattr->len - len);
 }
 
+static int uverbs_process_idrs_arr_attr(struct ib_uverbs_file *ufile,
+   struct uverbs_objs_arr_attr *attr,
+   const struct ib_uverbs_attr *uattr,
+   const struct uverbs_attr_spec *spec)
+{
+   const struct uverbs_object_spec *object;
+   int err;
+   int i = 0; /* Initialization for error flow */
+
+   if (!ufile->ucontext || uattr->attr_data.reserved)
+   return -EINVAL;
+
+   if (uattr->len % sizeof(u32))
+   return -EINVAL;
+
+   attr->len = uattr->len / sizeof(u32);
+
+   if (attr->len < spec->u2.objs_arr.min_len ||
+   attr->len > spec->u2.objs_arr.max_len)
+   return -EINVAL;
+
+   attr->uobjects = kvmalloc_array(attr->len, sizeof(*attr->uobjects),
+   GFP_KERNEL);
+   if (!attr->uobjects)
+   return -ENOMEM;
+
+   /* Since idr is 4B and *uobjects is >= 4B, we can use
+* attr->uobjects to store idrs array and avoid additional memory
+* allocation. The idrs array is offset to the end of the uobjects
+* array so we will be able to read a 4B idr and replace with a
+* 8B pointer.
+*/
+   if (uattr->len > sizeof(uattr->data)) {
+   err = copy_from_user((u8 *)attr->uobjects + uattr->len,
+u64_to_user_ptr(uattr->data),
+uattr->len);
+   if (err) {
+   err = -EFAULT;
+   goto err_objs_arr;
+   }
+   } else {
+   memcpy((u8 *)attr->uobjects + uattr->len, &uattr->data,
+  uattr->len);
+   }
+
+   object = uverbs_get_object(ufile, spec->u2.objs_arr.obj_type);
+   if (!object) {
+   err = -EINVAL;
+   goto err_objs_arr;
+   }
+
+   for (i = 0; i < attr->len; i++) {
+   attr->uobjects[i] =
+   uverbs_get_uobject_from_file(object->type_attrs, ufile,
+spec->u2.objs_arr.access,
+((u32 
*)attr->uobjects)[attr->len + i]);
+   if (IS_ERR(attr->uobjects[i])) {
+   err = PTR_ERR(attr->uobjects[i]);
+   goto err_objs_arr;
+   }
+   }
+
+   return 0;
+
+err_objs_arr:
+   while (i > 0)
+   uverbs_finalize_object(attr->uobjects[--i],
+  spec->u2.objs_arr.access, false);
+
+   kvfree(attr->uobjects);
+   return err;
+}
+
 static int uverbs_process_attr(struct ib_uverbs_file *ufile,
   const struct ib_uverbs_attr *uattr,
   u16 attr_id,
@@ -59,6 +132,7 @@ static int uverbs_process_attr(struct ib_uverbs_file *ufile,
const struct uverbs_object_spec *object;
struct uverbs_obj_attr *o_attr;
struct uverbs_attr *elements = attr_bundle_h->attrs;
+   int err;
 
if (attr_id >= attr_spec_bucket->num_attrs) {
if (uattr->flags & UVERBS_ATTR_F_MANDATORY)
@@ -176,6 +250,14 @@ static int uverbs_process_attr(struct ib_uverbs_file 
*ufile,
}
 
break;
+
+   case UVERBS_ATTR_TYPE_IDRS_ARRAY:
+   err = uverbs_process_idrs_arr_attr(ufile, &e->objs_arr_attr,
+  uattr, spec);
+   if (err)
+   return err;
+
+   break;
default:
return -EOPNOTSUPP;
}

[PATCH rdma-next 24/27] RDMA/mlx5: Refactor DEVX flow creation

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Move struct mlx5_flow_act to be passed from the METHOD entry point,
this will allow to add support for flow action for the DEVX path.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/flow.c|  4 +++-
 drivers/infiniband/hw/mlx5/main.c| 12 +++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  4 +++-
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/flow.c 
b/drivers/infiniband/hw/mlx5/flow.c
index a6b4f37a5359..072b8fc7e057 100644
--- a/drivers/infiniband/hw/mlx5/flow.c
+++ b/drivers/infiniband/hw/mlx5/flow.c
@@ -44,6 +44,7 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_CREATE_FLOW)(
struct ib_device *ib_dev, struct ib_uverbs_file *file,
struct uverbs_attr_bundle *attrs)
 {
+   struct mlx5_flow_act flow_act = {.flow_tag = MLX5_FS_DEFAULT_FLOW_TAG};
struct mlx5_ib_flow_handler *flow_handler;
struct mlx5_ib_flow_matcher *fs_matcher;
void *devx_obj;
@@ -106,7 +107,8 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_CREATE_FLOW)(
MLX5_IB_ATTR_CREATE_FLOW_MATCH_VALUE);
fs_matcher = uverbs_attr_get_obj(attrs,
 MLX5_IB_ATTR_CREATE_FLOW_MATCHER);
-   flow_handler = mlx5_ib_raw_fs_rule_add(dev, fs_matcher, cmd_in, inlen,
+   flow_handler = mlx5_ib_raw_fs_rule_add(dev, fs_matcher, &flow_act,
+  cmd_in, inlen,
   dest_id, dest_type);
if (IS_ERR(flow_handler))
return PTR_ERR(flow_handler);
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 81780beeb83c..2b2af82dc32e 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3729,10 +3729,10 @@ _create_raw_flow_rule(struct mlx5_ib_dev *dev,
  struct mlx5_ib_flow_prio *ft_prio,
  struct mlx5_flow_destination *dst,
  struct mlx5_ib_flow_matcher  *fs_matcher,
+ struct mlx5_flow_act *flow_act,
  void *cmd_in, int inlen)
 {
struct mlx5_ib_flow_handler *handler;
-   struct mlx5_flow_act flow_act = {.flow_tag = MLX5_FS_DEFAULT_FLOW_TAG};
struct mlx5_flow_spec *spec;
struct mlx5_flow_table *ft = ft_prio->flow_table;
int err = 0;
@@ -3751,9 +3751,8 @@ _create_raw_flow_rule(struct mlx5_ib_dev *dev,
   fs_matcher->mask_len);
spec->match_criteria_enable = fs_matcher->match_criteria_enable;
 
-   flow_act.action |= MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
handler->rule = mlx5_add_flow_rules(ft, spec,
-   &flow_act, dst, 1);
+   flow_act, dst, 1);
 
if (IS_ERR(handler->rule)) {
err = PTR_ERR(handler->rule);
@@ -3815,6 +3814,7 @@ static bool raw_fs_is_multicast(struct 
mlx5_ib_flow_matcher *fs_matcher,
 struct mlx5_ib_flow_handler *
 mlx5_ib_raw_fs_rule_add(struct mlx5_ib_dev *dev,
struct mlx5_ib_flow_matcher *fs_matcher,
+   struct mlx5_flow_act *flow_act,
void *cmd_in, int inlen, int dest_id,
int dest_type)
 {
@@ -3847,13 +3847,15 @@ mlx5_ib_raw_fs_rule_add(struct mlx5_ib_dev *dev,
if (dest_type == MLX5_FLOW_DESTINATION_TYPE_TIR) {
dst->type = dest_type;
dst->tir_num = dest_id;
+   flow_act->action |= MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
} else {
dst->type = MLX5_FLOW_DESTINATION_TYPE_FLOW_TABLE_NUM;
dst->ft_num = dest_id;
+   flow_act->action |= MLX5_FLOW_CONTEXT_ACTION_FWD_DEST;
}
 
-   handler = _create_raw_flow_rule(dev, ft_prio, dst, fs_matcher, cmd_in,
-   inlen);
+   handler = _create_raw_flow_rule(dev, ft_prio, dst, fs_matcher, flow_act,
+   cmd_in, inlen);
 
if (IS_ERR(handler)) {
err = PTR_ERR(handler);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 97fa894deafc..76f1c178cef7 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1249,7 +1250,8 @@ void mlx5_ib_devx_destroy(struct mlx5_ib_dev *dev,
 const struct uverbs_object_tree_def *mlx5_ib_get_devx_tree(void);
 struct mlx5_ib_flow_handler *mlx5_ib_raw_fs_rule_add(
struct mlx5_ib_dev *dev, struct mlx5_ib_flow_matcher *fs_matcher,
-   void *cmd_in, int inlen, int dest_id, int dest_type);
+   struct mlx5_flow_act *flow_act, void *cmd_in, int inlen,
+   int dest_id, int dest_type);
 bool mlx5_ib_devx_is_flow_dest(void *obj, int *dest_id, int *dest_

[PATCH mlx5-next 08/27] net/mlx5: Expose new packet reformat capabilities

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Expose new abilities when creating a packet reformat context.

The new types which can be created are:
MLX5_REFORMAT_TYPE_L2_TO_L2_TUNNEL: Ability to create generic encap
operation to be done by the HW.

MLX5_REFORMAT_TYPE_L3_TUNNEL_TO_L2: Ability to create generic decap
operation where the inner packet doesn't contain L2.

MLX5_REFORMAT_TYPE_L2_TO_L3_TUNNEL: Ability to create generic encap
operation to be done by the HW. The L2 of the original packet
is dropped.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 include/linux/mlx5/mlx5_ifc.h | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index f83435d749b6..9f26e53677ca 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -344,8 +344,12 @@ struct mlx5_ifc_flow_table_prop_layout_bits {
u8 reserved_at_c[0x1];
u8 pop_vlan_2[0x1];
u8 push_vlan_2[0x1];
-   u8 reserved_at_f[0x11];
-
+   u8 reformat_and_vlan_action[0x1];
+   u8 reserved_at_10[0x2];
+   u8 reformat_l3_tunnel_to_l2[0x1];
+   u8 reformat_l2_to_l3_tunnel[0x1];
+   u8 reformat_and_modify_action[0x1];
+   u8 reserved_at_14[0xb];
u8 reserved_at_20[0x2];
u8 log_max_ft_size[0x6];
u8 log_max_modify_header_context[0x8];
@@ -554,7 +558,13 @@ struct mlx5_ifc_flow_table_nic_cap_bits {
u8 nic_rx_multi_path_tirs[0x1];
u8 nic_rx_multi_path_tirs_fts[0x1];
u8 allow_sniffer_and_nic_rx_shared_tir[0x1];
-   u8 reserved_at_3[0x1fd];
+   u8 reserved_at_3[0x1d];
+   u8 encap_general_header[0x1];
+   u8 reserved_at_21[0xa];
+   u8 log_max_packet_reformat_context[0x5];
+   u8 reserved_at_30[0x6];
+   u8 max_encap_header_size[0xa];
+   u8 reserved_at_40[0x1c0];
 
struct mlx5_ifc_flow_table_prop_layout_bits 
flow_table_properties_nic_receive;
 
@@ -4846,6 +4856,9 @@ struct mlx5_ifc_alloc_packet_reformat_context_out_bits {
 enum {
MLX5_REFORMAT_TYPE_L2_TO_VXLAN = 0x0,
MLX5_REFORMAT_TYPE_L2_TO_NVGRE = 0x1,
+   MLX5_REFORMAT_TYPE_L2_TO_L2_TUNNEL = 0x2,
+   MLX5_REFORMAT_TYPE_L3_TUNNEL_TO_L2 = 0x3,
+   MLX5_REFORMAT_TYPE_L2_TO_L3_TUNNEL = 0x4,
 };
 
 struct mlx5_ifc_alloc_packet_reformat_context_in_bits {
-- 
2.14.4

[PATCH rdma-next 25/27] RDMA/mlx5: Add flow actions support to DEVX create flow

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Support attaching flow actions to a flow rule via DEVX.
For now only NIC RX path is supported.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/flow.c| 21 -
 include/uapi/rdma/mlx5_user_ioctl_cmds.h |  1 +
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/flow.c 
b/drivers/infiniband/hw/mlx5/flow.c
index 072b8fc7e057..b254b55e8de0 100644
--- a/drivers/infiniband/hw/mlx5/flow.c
+++ b/drivers/infiniband/hw/mlx5/flow.c
@@ -47,6 +47,7 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_CREATE_FLOW)(
struct mlx5_flow_act flow_act = {.flow_tag = MLX5_FS_DEFAULT_FLOW_TAG};
struct mlx5_ib_flow_handler *flow_handler;
struct mlx5_ib_flow_matcher *fs_matcher;
+   struct ib_uobject **arr_flow_actions;
void *devx_obj;
int dest_id, dest_type;
void *cmd_in;
@@ -56,6 +57,9 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_CREATE_FLOW)(
struct ib_uobject *uobj =
uverbs_attr_get_uobject(attrs, MLX5_IB_ATTR_CREATE_FLOW_HANDLE);
struct mlx5_ib_dev *dev = to_mdev(uobj->context->device);
+   int len;
+   int ret;
+   int i;
 
if (!capable(CAP_NET_RAW))
return -EPERM;
@@ -107,6 +111,18 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_CREATE_FLOW)(
MLX5_IB_ATTR_CREATE_FLOW_MATCH_VALUE);
fs_matcher = uverbs_attr_get_obj(attrs,
 MLX5_IB_ATTR_CREATE_FLOW_MATCHER);
+
+   len = uverbs_attr_get_uobjs_arr(attrs,
+   
MLX5_IB_ATTR_CREATE_FLOW_ARR_FLOW_ACTIONS,
+   &arr_flow_actions);
+   for (i = 0; i < len; i++) {
+   struct mlx5_ib_flow_action *maction = 
to_mflow_act(arr_flow_actions[i]->object);
+
+   ret = parse_flow_flow_action(maction, false, &flow_act);
+   if (ret)
+   return -EINVAL;
+   }
+
flow_handler = mlx5_ib_raw_fs_rule_add(dev, fs_matcher, &flow_act,
   cmd_in, inlen,
   dest_id, dest_type);
@@ -458,7 +474,10 @@ DECLARE_UVERBS_NAMED_METHOD(
UVERBS_ACCESS_READ),
UVERBS_ATTR_IDR(MLX5_IB_ATTR_CREATE_FLOW_DEST_DEVX,
MLX5_IB_OBJECT_DEVX_OBJ,
-   UVERBS_ACCESS_READ));
+   UVERBS_ACCESS_READ),
+   UVERBS_ATTR_IDRS_ARR(MLX5_IB_ATTR_CREATE_FLOW_ARR_FLOW_ACTIONS,
+UVERBS_OBJECT_FLOW_ACTION,
+UVERBS_ACCESS_READ, 1, 1));
 
 DECLARE_UVERBS_NAMED_METHOD_DESTROY(
MLX5_IB_METHOD_DESTROY_FLOW,
diff --git a/include/uapi/rdma/mlx5_user_ioctl_cmds.h 
b/include/uapi/rdma/mlx5_user_ioctl_cmds.h
index 75c7093fd95b..91c3d42ebd0f 100644
--- a/include/uapi/rdma/mlx5_user_ioctl_cmds.h
+++ b/include/uapi/rdma/mlx5_user_ioctl_cmds.h
@@ -155,6 +155,7 @@ enum mlx5_ib_create_flow_attrs {
MLX5_IB_ATTR_CREATE_FLOW_DEST_QP,
MLX5_IB_ATTR_CREATE_FLOW_DEST_DEVX,
MLX5_IB_ATTR_CREATE_FLOW_MATCHER,
+   MLX5_IB_ATTR_CREATE_FLOW_ARR_FLOW_ACTIONS,
 };
 
 enum mlx5_ib_destoy_flow_attrs {
-- 
2.14.4

[PATCH rdma-next 23/27] RDMA/mlx5: Refactor flow action parsing to be more generic

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Make the parsing of flow actions more generic so it could be used by
DEVX create flow.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/main.c| 13 +++--
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  3 +++
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 0af0bdc5804b..81780beeb83c 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2452,17 +2452,16 @@ static int check_mpls_supp_fields(u32 field_support, 
const __be32 *set_mask)
   offsetof(typeof(filter), field) -\
   sizeof(filter.field))
 
-static int parse_flow_flow_action(const union ib_flow_spec *ib_spec,
- const struct ib_flow_attr *flow_attr,
- struct mlx5_flow_act *action)
+int parse_flow_flow_action(struct mlx5_ib_flow_action *maction,
+  bool is_egress,
+  struct mlx5_flow_act *action)
 {
-   struct mlx5_ib_flow_action *maction = to_mflow_act(ib_spec->action.act);
 
switch (maction->ib_action.type) {
case IB_FLOW_ACTION_ESP:
/* Currently only AES_GCM keymat is supported by the driver */
action->esp_id = (uintptr_t)maction->esp_aes_gcm.ctx;
-   action->action |= flow_attr->flags & IB_FLOW_ATTR_FLAGS_EGRESS ?
+   action->action |= is_egress ?
MLX5_FLOW_CONTEXT_ACTION_ENCRYPT :
MLX5_FLOW_CONTEXT_ACTION_DECRYPT;
return 0;
@@ -2822,7 +2821,9 @@ static int parse_flow_attr(struct mlx5_core_dev *mdev, 
u32 *match_c,
action->action |= MLX5_FLOW_CONTEXT_ACTION_DROP;
break;
case IB_FLOW_SPEC_ACTION_HANDLE:
-   ret = parse_flow_flow_action(ib_spec, flow_attr, action);
+   ret = parse_flow_flow_action(to_mflow_act(ib_spec->action.act),
+flow_attr->flags & 
IB_FLOW_ATTR_FLAGS_EGRESS,
+action);
if (ret)
return ret;
break;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index bb7a902a347f..97fa894deafc 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -870,6 +870,9 @@ to_mcounters(struct ib_counters *ibcntrs)
return container_of(ibcntrs, struct mlx5_ib_mcounters, ibcntrs);
 }
 
+int parse_flow_flow_action(struct mlx5_ib_flow_action *maction,
+  bool is_egress,
+  struct mlx5_flow_act *action);
 struct mlx5_ib_dev {
struct ib_deviceib_dev;
struct mlx5_core_dev*mdev;
-- 
2.14.4

[PATCH rdma-next 26/27] RDMA/mlx5: Add NIC TX namespace when getting a flow table

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Add the ability to get a NIC TX flow table when using _get_flow_table().
This will allow to create a matcher and a flow rule on the NIC TX path.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/flow.c|  1 +
 drivers/infiniband/hw/mlx5/main.c| 38 ++--
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  1 +
 3 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/flow.c 
b/drivers/infiniband/hw/mlx5/flow.c
index b254b55e8de0..2422629f48c9 100644
--- a/drivers/infiniband/hw/mlx5/flow.c
+++ b/drivers/infiniband/hw/mlx5/flow.c
@@ -162,6 +162,7 @@ static int 
UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_MATCHER_CREATE)(
if (!obj)
return -ENOMEM;
 
+   obj->ns_type = MLX5_FLOW_NAMESPACE_BYPASS;
obj->mask_len = uverbs_attr_get_len(
attrs, MLX5_IB_ATTR_FLOW_MATCHER_MATCH_MASK);
err = uverbs_copy_from(&obj->matcher_mask,
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 2b2af82dc32e..ba4bcbc3adb6 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3695,33 +3695,52 @@ static struct ib_flow *mlx5_ib_create_flow(struct ib_qp 
*qp,
 }
 
 static struct mlx5_ib_flow_prio *_get_flow_table(struct mlx5_ib_dev *dev,
-int priority, bool mcast)
+struct mlx5_ib_flow_matcher 
*fs_matcher,
+bool mcast)
 {
-   int max_table_size;
struct mlx5_flow_namespace *ns = NULL;
struct mlx5_ib_flow_prio *prio;
+   int max_table_size = 0;
+   u32 flags = 0;
+   int priority;
+
+   if (fs_matcher->ns_type == MLX5_FLOW_NAMESPACE_BYPASS) {
+   max_table_size = BIT(MLX5_CAP_FLOWTABLE_NIC_RX(dev->mdev,
+  
log_max_ft_size));
+   if (MLX5_CAP_FLOWTABLE_NIC_RX(dev->mdev, decap))
+   flags |= MLX5_FLOW_TABLE_TUNNEL_EN_DECAP;
+   if (MLX5_CAP_FLOWTABLE_NIC_RX(dev->mdev,
+ reformat_l3_tunnel_to_l2))
+   flags |= MLX5_FLOW_TABLE_TUNNEL_EN_REFORMAT;
+   } else { /* Can only be MLX5_FLOW_NAMESPACE_EGRESS */
+   max_table_size = BIT(MLX5_CAP_FLOWTABLE_NIC_TX(dev->mdev,
+  
log_max_ft_size));
+   if (MLX5_CAP_FLOWTABLE_NIC_TX(dev->mdev, reformat))
+   flags |= MLX5_FLOW_TABLE_TUNNEL_EN_REFORMAT;
+   }
 
-   max_table_size = BIT(MLX5_CAP_FLOWTABLE_NIC_RX(dev->mdev,
-log_max_ft_size));
if (max_table_size < MLX5_FS_MAX_ENTRIES)
return ERR_PTR(-ENOMEM);
 
if (mcast)
priority = MLX5_IB_FLOW_MCAST_PRIO;
else
-   priority = ib_prio_to_core_prio(priority, false);
+   priority = ib_prio_to_core_prio(fs_matcher->priority, false);
 
-   ns = mlx5_get_flow_namespace(dev->mdev, MLX5_FLOW_NAMESPACE_BYPASS);
+   ns = mlx5_get_flow_namespace(dev->mdev, fs_matcher->ns_type);
if (!ns)
return ERR_PTR(-ENOTSUPP);
 
-   prio = &dev->flow_db->prios[priority];
+   if (fs_matcher->ns_type == MLX5_FLOW_NAMESPACE_BYPASS)
+   prio = &dev->flow_db->prios[priority];
+   else
+   prio = &dev->flow_db->egress_prios[priority];
 
if (prio->flow_table)
return prio;
 
return _get_prio(ns, prio, priority, MLX5_FS_MAX_ENTRIES,
-MLX5_FS_MAX_TYPES, 0);
+MLX5_FS_MAX_TYPES, flags);
 }
 
 static struct mlx5_ib_flow_handler *
@@ -3820,7 +3839,6 @@ mlx5_ib_raw_fs_rule_add(struct mlx5_ib_dev *dev,
 {
struct mlx5_flow_destination *dst;
struct mlx5_ib_flow_prio *ft_prio;
-   int priority = fs_matcher->priority;
struct mlx5_ib_flow_handler *handler;
bool mcast;
int err;
@@ -3838,7 +3856,7 @@ mlx5_ib_raw_fs_rule_add(struct mlx5_ib_dev *dev,
mcast = raw_fs_is_multicast(fs_matcher, cmd_in);
mutex_lock(&dev->flow_db->lock);
 
-   ft_prio = _get_flow_table(dev, priority, mcast);
+   ft_prio = _get_flow_table(dev, fs_matcher, mcast);
if (IS_ERR(ft_prio)) {
err = PTR_ERR(ft_prio);
goto unlock;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 76f1c178cef7..639b5dccf079 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -187,6 +187,7 @@ struct mlx5_ib_flow_matcher {
struct mlx5_ib_match_params matcher_mask;
int mask_len;
enum mlx5_ib_flow_type  flow_type;
+   u8  ns_type;
u1

[PATCH rdma-next 27/27] RDMA/mlx5: Allow creating a matcher for a NIC TX flow table

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

Currently a matcher can only be created and attached to a NIC RX flow
table. Extend it to allow it on NIC TX flow tables as well.

In order to achieve that, We:

1) Expose a new attribute: MLX5_IB_ATTR_FLOW_MATCHER_FLOW_FLAGS.
   enum ib_flow_flags is used as valid flags. Only
   IB_FLOW_ATTR_FLAGS_EGRESS is supported.

2) Remove the requirement to have a DEVX or QP destination when creating a
   flow. A flow added to NIC TX flow table will forward the packet outside
   of the vport (Wire or E-Switch in the SR-iOV case).

Only a single flow action can be attached to a flow rule at the moment.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/flow.c| 62 +---
 drivers/infiniband/hw/mlx5/main.c|  5 ++-
 include/uapi/rdma/mlx5_user_ioctl_cmds.h |  1 +
 3 files changed, 46 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/flow.c 
b/drivers/infiniband/hw/mlx5/flow.c
index 2422629f48c9..26c5112100e6 100644
--- a/drivers/infiniband/hw/mlx5/flow.c
+++ b/drivers/infiniband/hw/mlx5/flow.c
@@ -18,6 +18,22 @@
 #define UVERBS_MODULE_NAME mlx5_ib
 #include 
 
+static int mlx5_ib_ft_type_to_namespace(u8 table_type, u8 *namespace)
+{
+   switch (table_type) {
+   case MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_RX:
+   *namespace = MLX5_FLOW_NAMESPACE_BYPASS;
+   break;
+   case MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_TX:
+   *namespace = MLX5_FLOW_NAMESPACE_EGRESS;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static const struct uverbs_attr_spec mlx5_ib_flow_type[] = {
[MLX5_IB_FLOW_TYPE_NORMAL] = {
.type = UVERBS_ATTR_TYPE_PTR_IN,
@@ -69,7 +85,14 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_CREATE_FLOW)(
dest_qp = uverbs_attr_is_valid(attrs,
   MLX5_IB_ATTR_CREATE_FLOW_DEST_QP);
 
-   if ((dest_devx && dest_qp) || (!dest_devx && !dest_qp))
+   fs_matcher = uverbs_attr_get_obj(attrs,
+MLX5_IB_ATTR_CREATE_FLOW_MATCHER);
+   if (fs_matcher->ns_type == MLX5_FLOW_NAMESPACE_BYPASS &&
+   ((dest_devx && dest_qp) || (!dest_devx && !dest_qp)))
+   return -EINVAL;
+
+   if (fs_matcher->ns_type == MLX5_FLOW_NAMESPACE_EGRESS &&
+   (dest_devx || dest_qp))
return -EINVAL;
 
if (dest_devx) {
@@ -83,7 +106,7 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_CREATE_FLOW)(
 */
if (!mlx5_ib_devx_is_flow_dest(devx_obj, &dest_id, &dest_type))
return -EINVAL;
-   } else {
+   } else if (dest_qp) {
struct mlx5_ib_qp *mqp;
 
qp = uverbs_attr_get_obj(attrs,
@@ -100,6 +123,8 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_CREATE_FLOW)(
else
dest_id = mqp->raw_packet_qp.rq.tirn;
dest_type = MLX5_FLOW_DESTINATION_TYPE_TIR;
+   } else {
+   dest_type = MLX5_FLOW_DESTINATION_TYPE_PORT;
}
 
if (dev->rep)
@@ -109,8 +134,6 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_CREATE_FLOW)(
attrs, MLX5_IB_ATTR_CREATE_FLOW_MATCH_VALUE);
inlen = uverbs_attr_get_len(attrs,
MLX5_IB_ATTR_CREATE_FLOW_MATCH_VALUE);
-   fs_matcher = uverbs_attr_get_obj(attrs,
-MLX5_IB_ATTR_CREATE_FLOW_MATCHER);
 
len = uverbs_attr_get_uobjs_arr(attrs,

MLX5_IB_ATTR_CREATE_FLOW_ARR_FLOW_ACTIONS,
@@ -156,6 +179,7 @@ static int 
UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_MATCHER_CREATE)(
attrs, MLX5_IB_ATTR_FLOW_MATCHER_CREATE_HANDLE);
struct mlx5_ib_dev *dev = to_mdev(uobj->context->device);
struct mlx5_ib_flow_matcher *obj;
+   u32 flags;
int err;
 
obj = kzalloc(sizeof(struct mlx5_ib_flow_matcher), GFP_KERNEL);
@@ -188,6 +212,16 @@ static int 
UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_MATCHER_CREATE)(
if (err)
goto end;
 
+   err = uverbs_get_flags32(&flags, attrs,
+MLX5_IB_ATTR_FLOW_MATCHER_FLOW_FLAGS,
+IB_FLOW_ATTR_FLAGS_EGRESS);
+   if (!err && flags) {
+   err = 
mlx5_ib_ft_type_to_namespace(MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_TX,
+  &obj->ns_type);
+   if (err)
+   goto end;
+   }
+
uobj->object = obj;
obj->mdev = dev->mdev;
atomic_set(&obj->usecnt, 0);
@@ -198,22 +232,6 @@ static int 
UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_MATCHER_CREATE)(
return err;
 }
 
-static int mlx5_ib_ft_type_to_namespace(u8 table_type, u8 *namespace)
-{
-   switch (table_type) {
-   case MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_RX:
-   *n

[PATCH rdma-next 17/27] RDMA/mlx5: Add new flow action verb, packet reformat

2018-07-29 Thread Leon Romanovsky

From: Mark Bloch 

For now, only add L2_TUNNEL_TO_L2 option, for example this can be used
to decap VXLAN packets.

Signed-off-by: Mark Bloch 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/flow.c | 76 ++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  1 +
 include/uapi/rdma/mlx5_user_ioctl_cmds.h  |  7 +++
 include/uapi/rdma/mlx5_user_ioctl_verbs.h |  4 ++
 4 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/flow.c 
b/drivers/infiniband/hw/mlx5/flow.c
index d0325e468801..0fea98cb7d42 100644
--- a/drivers/infiniband/hw/mlx5/flow.c
+++ b/drivers/infiniband/hw/mlx5/flow.c
@@ -202,6 +202,8 @@ void mlx5_ib_destroy_flow_action_raw(struct 
mlx5_ib_flow_action *maction)
mlx5_modify_header_dealloc(maction->flow_action_raw.dev->mdev,
   maction->flow_action_raw.action_id);
break;
+   case MLX5_IB_FLOW_ACTION_DECAP:
+   break;
default:
WARN_ON(true);
break;
@@ -284,6 +286,64 @@ static int 
UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_ACTION_CREATE_MODIFY_HEADER)(struc
return 0;
 }
 
+static bool mlx5_ib_flow_action_packet_reformat_valid(struct mlx5_ib_dev 
*ibdev,
+ u8 packet_reformat_type,
+ u8 ft_type)
+{
+   switch (packet_reformat_type) {
+   case MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L2_TUNNEL_TO_L2:
+   if (ft_type == MLX5_IB_UAPI_FLOW_TABLE_TYPE_NIC_RX)
+   return MLX5_CAP_FLOWTABLE_NIC_RX(ibdev->mdev, decap);
+   break;
+   default:
+   break;
+   }
+
+   return false;
+}
+
+static int 
UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_ACTION_CREATE_PACKET_REFORMAT)(struct 
ib_device *ib_dev,
+
struct ib_uverbs_file *file,
+
struct uverbs_attr_bundle *attrs)
+{
+   struct ib_uobject *uobj = uverbs_attr_get_uobject(attrs,
+  
MLX5_IB_ATTR_CREATE_PACKET_REFORMAT_HANDLE);
+   struct mlx5_ib_dev *mdev = to_mdev(uobj->context->device);
+   enum mlx5_ib_uapi_flow_action_packet_reformat_type dv_prt;
+   enum mlx5_ib_uapi_flow_table_type ft_type;
+   struct mlx5_ib_flow_action *maction;
+   int ret;
+
+   ret = uverbs_get_const(&ft_type, attrs,
+  MLX5_IB_ATTR_CREATE_PACKET_REFORMAT_FT_TYPE);
+   if (ret)
+   return -EINVAL;
+
+   ret = uverbs_get_const(&dv_prt, attrs,
+  MLX5_IB_ATTR_CREATE_PACKET_REFORMAT_TYPE);
+   if (ret)
+   return -EINVAL;
+
+   if (!mlx5_ib_flow_action_packet_reformat_valid(mdev, dv_prt, ft_type))
+   return -EOPNOTSUPP;
+
+   maction = kzalloc(sizeof(*maction), GFP_KERNEL);
+   if (!maction)
+   return -ENOMEM;
+
+   if (dv_prt ==
+   MLX5_IB_UAPI_FLOW_ACTION_PACKET_REFORMAT_TYPE_L2_TUNNEL_TO_L2) {
+   maction->flow_action_raw.sub_type =
+   MLX5_IB_FLOW_ACTION_DECAP;
+   maction->flow_action_raw.dev = mdev;
+   }
+
+   uverbs_flow_action_fill_action(&maction->ib_action, uobj,
+  uobj->context->device,
+  IB_FLOW_ACTION_UNSPECIFIED);
+   return 0;
+}
+
 DECLARE_UVERBS_NAMED_METHOD(
MLX5_IB_METHOD_CREATE_FLOW,
UVERBS_ATTR_IDR(MLX5_IB_ATTR_CREATE_FLOW_HANDLE,
@@ -332,9 +392,23 @@ DECLARE_UVERBS_NAMED_METHOD(
 enum mlx5_ib_uapi_flow_table_type,
 UA_MANDATORY));
 
+DECLARE_UVERBS_NAMED_METHOD(
+   MLX5_IB_METHOD_FLOW_ACTION_CREATE_PACKET_REFORMAT,
+   UVERBS_ATTR_IDR(MLX5_IB_ATTR_CREATE_PACKET_REFORMAT_HANDLE,
+   UVERBS_OBJECT_FLOW_ACTION,
+   UVERBS_ACCESS_NEW,
+   UA_MANDATORY),
+   UVERBS_ATTR_CONST_IN(MLX5_IB_ATTR_CREATE_PACKET_REFORMAT_TYPE,
+enum mlx5_ib_uapi_flow_action_packet_reformat_type,
+UA_MANDATORY),
+   UVERBS_ATTR_CONST_IN(MLX5_IB_ATTR_CREATE_PACKET_REFORMAT_FT_TYPE,
+enum mlx5_ib_uapi_flow_table_type,
+UA_MANDATORY));
+
 ADD_UVERBS_METHODS(mlx5_ib_flow_actions,
   UVERBS_OBJECT_FLOW_ACTION,
-  
&UVERBS_METHOD(MLX5_IB_METHOD_FLOW_ACTION_CREATE_MODIFY_HEADER));
+  
&UVERBS_METHOD(MLX5_IB_METHOD_FLOW_ACTION_CREATE_MODIFY_HEADER),
+  
&UVERBS_METHOD(MLX5_IB_METHOD_FLOW_ACTION_CREATE_PACKET_REFORMAT));
 
 DECLARE_UVERBS_NAMED_METHOD(
MLX5_IB_METHOD_FLOW_MATCHER_CREATE,
diff --git a/drivers/infiniband/

partnership offer

2018-07-29 Thread Rosarita Houmam

I want us to join hands as partners because i have a deal for you

Re: [PATCH 2/4] net: dsa: Add Lantiq / Intel GSWIP tag support

2018-07-29 Thread Hauke Mehrtens

On 07/25/2018 04:20 PM, Andrew Lunn wrote:
> On Sat, Jul 21, 2018 at 09:13:56PM +0200, Hauke Mehrtens wrote:
>> This handles the tag added by the PMAC on the VRX200 SoC line.
>>
>> The GSWIP uses internally a GSWIP special tag which is located after the
>> Ethernet header. The PMAC which connects the GSWIP to the CPU converts
>> this special tag used by the GSWIP into the PMAC special tag which is
>> added in front of the Ethernet header.
>>
>> This was tested with GSWIP 2.0 found in the VRX200 SoCs, other GSWIP
>> versions use slightly different PMAC special tags
>>
>> Signed-off-by: Hauke Mehrtens 
> 
> Hi Hauke
> 
> This looks good. A new minor nitpicks below.
> 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include "dsa_priv.h"
>> +
>> +
>> +#define GSWIP_TX_HEADER_LEN 4
> 
> Single newline is sufficient.

removed
> 
>> +/* Byte 3 */
>> +#define GSWIP_TX_CRCGEN_DIS BIT(23)
> 
> BIT(23) in a byte is a bit odd.

OK, this should be BIT(7)
The ordering of these defines was also strange I fixed that.

> 
>> +#define GSWIP_TX_SLPID_SHIFT0   /* source port ID */
>> +#define  GSWIP_TX_SLPID_CPU 2
>> +#define  GSWIP_TX_SLPID_APP13
>> +#define  GSWIP_TX_SLPID_APP24
>> +#define  GSWIP_TX_SLPID_APP35
>> +#define  GSWIP_TX_SLPID_APP46
>> +#define  GSWIP_TX_SLPID_APP57
>> +
>> +
>> +#define GSWIP_RX_HEADER_LEN 8
> 
> Single newline is sufficient. Please fix them all, if there are more
> of them.

ok

>> +
>> +/* special tag in RX path header */
>> +/* Byte 7 */
>> +#define GSWIP_RX_SPPID_SHIFT4
>> +#define GSWIP_RX_SPPID_MASK GENMASK(6, 4)
>> +
>> +static struct sk_buff *gswip_tag_rcv(struct sk_buff *skb,
>> + struct net_device *dev,
>> + struct packet_type *pt)
>> +{
>> +int port;
>> +u8 *gswip_tag;
>> +
>> +if (unlikely(!pskb_may_pull(skb, GSWIP_RX_HEADER_LEN)))
>> +return NULL;
>> +
>> +gswip_tag = ((u8 *)skb->data) - ETH_HLEN;
> 
> The cast should not be needed, data already is an unsigned char.

OK, I removed that.

>> +skb_pull_rcsum(skb, GSWIP_RX_HEADER_LEN);
>> +
>> +/* Get source port information */
>> +port = (gswip_tag[7] & GSWIP_RX_SPPID_MASK) >> GSWIP_RX_SPPID_SHIFT;
>> +skb->dev = dsa_master_find_slave(dev, 0, port);
>> +if (!skb->dev)
>> +return NULL;
>> +
>> +return skb;
>> +}
> 
>   Andrew
>

Re: [PATCH 3/4] net: lantiq: Add Lantiq / Intel vrx200 Ethernet driver

2018-07-29 Thread Hauke Mehrtens

On 07/25/2018 05:28 PM, Andrew Lunn wrote:
>> +/* Make sure the firmware of the embedded GPHY is loaded before,
>> + * otherwise they will not be detectable on the MDIO bus.
>> + */
>> +of_for_each_phandle(&it, err, np, "lantiq,phys", NULL, 0) {
>> +phy_np = it.node;
>> +if (phy_np) {
>> +struct platform_device *phy = 
>> of_find_device_by_node(phy_np);
>> +
>> +of_node_put(phy_np);
>> +if (!platform_get_drvdata(phy))
>> +return -EPROBE_DEFER;
>> +}
>> +}
> 
> Is there a device tree binding document for this somewhere?
> 
>Andrew
> 

No, but I will create one.

I am also not sure iof this is the correct way of doing this.

We first have to load the FW into the Ethernet PHY though some generic
SoC registers and then we can find it normally on the MDIO bus and
interact with it like an external PHY on the MDIO bus.

Hauke

RE: Done

2018-07-29 Thread Walker, Alan









$ 3million donated to you, E-mail 
tsouam...@gmail.com for funds confirmation.

Re: pull-request: can-next 2018-01-16,pull-request: can-next 2018-01-16

2018-07-29 Thread David Miller

From: Marc Kleine-Budde 
Date: Fri, 27 Jul 2018 11:40:10 +0200

> this is a pull request for net-next/master consisting of 38 patches.
 ...
> The following changes since commit ecbcd689d74a394b711d2360aef7e5d007ec9d98:
> 
>   Merge tag 'mlx5e-updates-2018-07-26' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux (2018-07-26 
> 21:33:24 -0700)
> 
> are available in the Git repository at:
> 
>   
> ssh://g...@gitolite.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git
>  tags/linux-can-next-for-4.19-20180727

Pulled, thanks Marc.

Re: [PATCH bpf] bpf: fix bpf_skb_load_bytes_relative pkt length check

2018-07-29 Thread Alexei Starovoitov

On Sat, Jul 28, 2018 at 10:04 PM, Daniel Borkmann  wrote:
> The len > skb_headlen(skb) cannot be used as a maximum upper bound
> for the packet length since it does not have any relation to the full
> linear packet length when filtering is used from upper layers (e.g.
> in case of reuseport BPF programs) as by then skb->data, skb->len
> already got mangled through __skb_pull() and others.
>
> Fixes: 4e1ec56cdc59 ("bpf: add skb_load_bytes_relative helper")
> Signed-off-by: Daniel Borkmann 
> Acked-by: Martin KaFai Lau 

Great catch.
Acked-by: Alexei Starovoitov

Re: [PATCH 3/4] net: lantiq: Add Lantiq / Intel vrx200 Ethernet driver

2018-07-29 Thread Andrew Lunn

On Sun, Jul 29, 2018 at 04:03:10PM +0200, Hauke Mehrtens wrote:
> On 07/25/2018 05:28 PM, Andrew Lunn wrote:
> >> +  /* Make sure the firmware of the embedded GPHY is loaded before,
> >> +   * otherwise they will not be detectable on the MDIO bus.
> >> +   */
> >> +  of_for_each_phandle(&it, err, np, "lantiq,phys", NULL, 0) {
> >> +  phy_np = it.node;
> >> +  if (phy_np) {
> >> +  struct platform_device *phy = 
> >> of_find_device_by_node(phy_np);
> >> +
> >> +  of_node_put(phy_np);
> >> +  if (!platform_get_drvdata(phy))
> >> +  return -EPROBE_DEFER;
> >> +  }
> >> +  }
> > 
> > Is there a device tree binding document for this somewhere?
> > 
> >Andrew
> > 
> 
> No, but I will create one.
> 
> I am also not sure iof this is the correct way of doing this.
> 
> We first have to load the FW into the Ethernet PHY though some generic
> SoC registers and then we can find it normally on the MDIO bus and
> interact with it like an external PHY on the MDIO bus.

Hi Hauke

It look sensible so far, but it would be good to post the PHY firmware
download code as well. Lets see the big picture, then we can decide if
there is a better way.

Andrew

Re: [PATCH 3/4] net: lantiq: Add Lantiq / Intel vrx200 Ethernet driver

2018-07-29 Thread Hauke Mehrtens

On 07/29/2018 05:51 PM, Andrew Lunn wrote:
> On Sun, Jul 29, 2018 at 04:03:10PM +0200, Hauke Mehrtens wrote:
>> On 07/25/2018 05:28 PM, Andrew Lunn wrote:
 +  /* Make sure the firmware of the embedded GPHY is loaded before,
 +   * otherwise they will not be detectable on the MDIO bus.
 +   */
 +  of_for_each_phandle(&it, err, np, "lantiq,phys", NULL, 0) {
 +  phy_np = it.node;
 +  if (phy_np) {
 +  struct platform_device *phy = 
 of_find_device_by_node(phy_np);
 +
 +  of_node_put(phy_np);
 +  if (!platform_get_drvdata(phy))
 +  return -EPROBE_DEFER;
 +  }
 +  }
>>>
>>> Is there a device tree binding document for this somewhere?
>>>
>>>Andrew
>>>
>>
>> No, but I will create one.
>>
>> I am also not sure iof this is the correct way of doing this.
>>
>> We first have to load the FW into the Ethernet PHY though some generic
>> SoC registers and then we can find it normally on the MDIO bus and
>> interact with it like an external PHY on the MDIO bus.
> 
> Hi Hauke
> 
> It look sensible so far, but it would be good to post the PHY firmware
> download code as well. Lets see the big picture, then we can decide if
> there is a better way.

Hi Andrew,

It is already in the kernel tree and can be found here:
https://elixir.bootlin.com/linux/v4.18-rc6/source/drivers/soc/lantiq/gphy.c

I am thinking about merging this into the switch driver, then we do not
have to configure the dependency any more.

Hauke

Re: [PATCH] bpf: verifier: BPF_MOV don't mark dst reg if src == dst

2018-07-29 Thread Alexei Starovoitov

On Thu, Jul 26, 2018 at 1:08 AM, Arthur Fabre  wrote:
> When check_alu_op() handles a BPF_MOV between two registers,
> it calls check_reg_arg() on the dst register, marking it as unbounded.
> If the src and dst register are the same, this marks the src as
> unbounded, which can lead to unexpected errors for further checks that
> rely on bounds info.
>
> check_alu_op() now only marks the dst register as unbounded if it
> different from the src register.
>
> Signed-off-by: Arthur Fabre 
> ---
>  kernel/bpf/verifier.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 63aaac52a265..ddfe3c544a80 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3238,8 +3238,9 @@ static int check_alu_op(struct bpf_verifier_env
> *env, struct bpf_insn *insn)
> }
> }
>
> -   /* check dest operand */
> -   err = check_reg_arg(env, insn->dst_reg, DST_OP);
> +   /* check dest operand, only mark if dest != src */
> +   err = check_reg_arg(env, insn->dst_reg,
> +   insn->dst_reg == insn->src_reg ?
> DST_OP_NO_MARK : DST_OP);

that doesn't look correct for 32-bit mov.
Is that the case you're trying to improve?

Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-29 Thread David Miller

From: Caleb Raitto 
Date: Mon, 23 Jul 2018 16:11:19 -0700

> From: Caleb Raitto 
> 
> The driver disables tx napi if it's not certain that completions will
> be processed affine with tx service.
> 
> Its heuristic doesn't account for some scenarios where it is, such as
> when the queue pair count matches the core but not hyperthread count.
> 
> Allow userspace to override the heuristic. This is an alternative
> solution to that in the linked patch. That added more logic in the
> kernel for these cases, but the agreement was that this was better left
> to user control.
> 
> Do not expand the existing napi_tx variable to a ternary value,
> because doing so can break user applications that expect
> boolean ('Y'/'N') instead of integer output. Add a new param instead.
> 
> Link: https://patchwork.ozlabs.org/patch/725249/
> Acked-by: Willem de Bruijn 
> Acked-by: Jon Olson 
> Signed-off-by: Caleb Raitto 

So I looked into the history surrounding these issues.

First of all, it's always ends up turning out crummy when drivers start
to set affinities themselves.  The worst possible case is to do it
_conditionally_, and that is exactly what virtio_net is doing.

>From the user's perspective, this provides a really bad experience.

So if I have a 32-queue device and there are 32 cpus, you'll do all
the affinity settings, stopping Irqbalanced from doing anything
right?

So if I add one more cpu, you'll say "oops, no idea what to do in
this situation" and not touch the affinities at all?

That makes no sense at all.

If the driver is going to set affinities at all, OWN that decision
and set it all the time to something reasonable.

Or accept that you shouldn't be touching this stuff in the first place
and leave the affinities alone.

Right now we're kinda in a situation where the driver has been setting
affinities in the ncpus==nqueues cases for some time, so we can't stop
doing it.

Which means we have to set them in all cases to make the user
experience sane again.

I looked at the linked to patch again:

https://patchwork.ozlabs.org/patch/725249/

And I think the strategy should be made more generic, to get rid of
the hyperthreading assumptions.  I also agree that the "assign
to first N cpus" logic doesn't make much sense either.

Just distribute across the available cpus evenly, and be done with it.
If you have 64 cpus and 32 queues, this assigns queues to every other
cpu.

Then we don't need this weird new module parameter.

Re: [PATCH 4/4] net: dsa: Add Lantiq / Intel DSA driver for vrx200

2018-07-29 Thread Hauke Mehrtens

On 07/25/2018 06:12 PM, Andrew Lunn wrote:
>>  LANTIQ MIPS ARCHITECTURE
>>  M:  John Crispin 
>> diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
>> index 2b81b97e994f..f1280aa3f9bd 100644
>> --- a/drivers/net/dsa/Kconfig
>> +++ b/drivers/net/dsa/Kconfig
>> @@ -23,6 +23,14 @@ config NET_DSA_LOOP
>>This enables support for a fake mock-up switch chip which
>>exercises the DSA APIs.
>>  
>> +config NET_DSA_GSWIP
>> +tristate "Intel / Lantiq GSWIP"
> 
> Minor nit pick. Could you make this NET_DSA_LANTIQ_GSWIP. We generally
> have some manufacture ID in the name. And change the text to Lantiq /
> Intel GSWIP.

done

>> +static const struct gswip_rmon_cnt_desc gswip_rmon_cnt[] = {
>> +/** Receive Packet Count (only packets that are accepted and not 
>> discarded). */
>> +MIB_DESC(1, 0x1F, "RxGoodPkts"),
>> +/** Receive Unicast Packet Count. */
>> +MIB_DESC(1, 0x23, "RxUnicastPkts"),
>> +/** Receive Multicast Packet Count. */
>> +MIB_DESC(1, 0x22, "RxMulticastPkts"),
>> +/** Receive FCS Error Packet Count. */
>> +MIB_DESC(1, 0x21, "RxFCSErrorPkts"),
>> +/** Receive Undersize Good Packet Count. */
>> +MIB_DESC(1, 0x1D, "RxUnderSizeGoodPkts"),
>> +/** Receive Undersize Error Packet Count. */
>> +MIB_DESC(1, 0x1E, "RxUnderSizeErrorPkts"),
>> +/** Receive Oversize Good Packet Count. */
>> +MIB_DESC(1, 0x1B, "RxOversizeGoodPkts"),
>> +/** Receive Oversize Error Packet Count. */
>> +MIB_DESC(1, 0x1C, "RxOversizeErrorPkts"),
>> +/** Receive Good Pause Packet Count. */
>> +MIB_DESC(1, 0x20, "RxGoodPausePkts"),
>> +/** Receive Align Error Packet Count. */
>> +MIB_DESC(1, 0x1A, "RxAlignErrorPkts"),
>> +/** Receive Size 64 Packet Count. */
>> +MIB_DESC(1, 0x12, "Rx64BytePkts"),
>> +/** Receive Size 65-127 Packet Count. */
>> +MIB_DESC(1, 0x13, "Rx127BytePkts"),
>> +/** Receive Size 128-255 Packet Count. */
>> +MIB_DESC(1, 0x14, "Rx255BytePkts"),
>> +/** Receive Size 256-511 Packet Count. */
>> +MIB_DESC(1, 0x15, "Rx511BytePkts"),
>> +/** Receive Size 512-1023 Packet Count. */
>> +MIB_DESC(1, 0x16, "Rx1023BytePkts"),
>> +/** Receive Size 1024-1522 (or more, if configured) Packet Count. */
>> +MIB_DESC(1, 0x17, "RxMaxBytePkts"),
>> +/** Receive Dropped Packet Count. */
>> +MIB_DESC(1, 0x18, "RxDroppedPkts"),
>> +/** Filtered Packet Count. */
>> +MIB_DESC(1, 0x19, "RxFilteredPkts"),
>> +/** Receive Good Byte Count (64 bit). */
>> +MIB_DESC(2, 0x24, "RxGoodBytes"),
>> +/** Receive Bad Byte Count (64 bit). */
>> +MIB_DESC(2, 0x26, "RxBadBytes"),
>> +/** Transmit Dropped Packet Count, based on Congestion Management. */
>> +MIB_DESC(1, 0x11, "TxAcmDroppedPkts"),
>> +/** Transmit Packet Count. */
>> +MIB_DESC(1, 0x0C, "TxGoodPkts"),
>> +/** Transmit Unicast Packet Count. */
>> +MIB_DESC(1, 0x06, "TxUnicastPkts"),
>> +/** Transmit Multicast Packet Count. */
>> +MIB_DESC(1, 0x07, "TxMulticastPkts"),
>> +/** Transmit Size 64 Packet Count. */
>> +MIB_DESC(1, 0x00, "Tx64BytePkts"),
>> +/** Transmit Size 65-127 Packet Count. */
>> +MIB_DESC(1, 0x01, "Tx127BytePkts"),
>> +/** Transmit Size 128-255 Packet Count. */
>> +MIB_DESC(1, 0x02, "Tx255BytePkts"),
>> +/** Transmit Size 256-511 Packet Count. */
>> +MIB_DESC(1, 0x03, "Tx511BytePkts"),
>> +/** Transmit Size 512-1023 Packet Count. */
>> +MIB_DESC(1, 0x04, "Tx1023BytePkts"),
>> +/** Transmit Size 1024-1522 (or more, if configured) Packet Count. */
>> +MIB_DESC(1, 0x05, "TxMaxBytePkts"),
>> +/** Transmit Single Collision Count. */
>> +MIB_DESC(1, 0x08, "TxSingleCollCount"),
>> +/** Transmit Multiple Collision Count. */
>> +MIB_DESC(1, 0x09, "TxMultCollCount"),
>> +/** Transmit Late Collision Count. */
>> +MIB_DESC(1, 0x0A, "TxLateCollCount"),
>> +/** Transmit Excessive Collision Count. */
>> +MIB_DESC(1, 0x0B, "TxExcessCollCount"),
>> +/** Transmit Pause Packet Count. */
>> +MIB_DESC(1, 0x0D, "TxPauseCount"),
>> +/** Transmit Drop Packet Count. */
>> +MIB_DESC(1, 0x10, "TxDroppedPkts"),
>> +/** Transmit Good Byte Count (64 bit). */
>> +MIB_DESC(2, 0x0E, "TxGoodBytes"),
> 
> Most of the comments here don't add anything useful. Maybe remove
> them?

Ok I removed them. Are the names ok, or should they follow any Linux
definition?

>> +};
>> +
>> +static u32 gswip_switch_r(struct gswip_priv *priv, u32 offset)
>> +{
>> +return __raw_readl(priv->gswip + (offset * 4));
>> +}
>> +
>> +static void gswip_switch_w(struct gswip_priv *priv, u32 val, u32 offset)
>> +{
>> +return __raw_writel(val, priv->gswip + (offset * 4));
>> +}
> 
> Since this is MIPS, i assume re-ordering cannot happen, there are
> barriers, etc?

As far as I know this is not a problem on this bus and no barriers are
needed here.

>> +static int xrx200_mdio_poll(struct gswip_priv *priv)
>> +{
>> +

Re: [PATCH 3/4] net: lantiq: Add Lantiq / Intel vrx200 Ethernet driver

2018-07-29 Thread Andrew Lunn

> I am thinking about merging this into the switch driver, then we do not
> have to configure the dependency any more.

Hi Hauke

Are there any PHYs which are not part of the switch?

Making it part of the switch driver would make sense. Are there any
backwards compatibility issues? I don't actually see any boards in
mailine using the compatible strings.

Another option would be to write an independent mdio driver, and make
firmware download part of that. That gives the advantage of supporting
PHYs which are not part of the switch.

 Andrew

Re: [PATCH 4/4] net: dsa: Add Lantiq / Intel DSA driver for vrx200

2018-07-29 Thread Andrew Lunn

> >> +static const struct gswip_rmon_cnt_desc gswip_rmon_cnt[] = {
> >> +  /** Receive Packet Count (only packets that are accepted and not 
> >> discarded). */
> >> +  MIB_DESC(1, 0x1F, "RxGoodPkts"),
> >> +  /** Receive Size 1024-1522 (or more, if configured) Packet Count. */
> >> +  MIB_DESC(1, 0x17, "RxMaxBytePkts"),
> >> +  /** Transmit Size 1024-1522 (or more, if configured) Packet Count. */
> >> +  MIB_DESC(1, 0x05, "TxMaxBytePkts"),
> > 
> > Most of the comments here don't add anything useful. Maybe remove
> > them?
> 
> Ok I removed them.

The comments i left above are useful, since they give additional
information which is not obvious from the name.

> Are the names ok, or should they follow any Linux definition?

There are no standard names. So each driver tends to be different.

> > Please return ETIMEOUT when needed. Maybe use one of the variants of
> > readx_poll_timeout().
> 
> I am returning ETIMEOUT now.
> 
> When I would use readx_poll_timeout() I can not use the gswip_mdio_r()
> function, because it takes two arguments and would have to use readl
> directly.

Yes, they don't always fit, which is why is said "maybe".

> > The names make this unclear. The callback is used to configure the MAC
> > layer when something happens at the PHY layer. phyaddr does not appear
> > to be an address, not should it be doing anything to a PHY.
> 
> I renamed this to phyconf, as this contains multiple configuration
> values. This tells the mac what settings the phy wants to use.

macconf might be better, since this is configuring the MAC, not the
PHY.

> This is sort of a firmware, but it is also in the GPL driver.
> Currently the probe function is not marked __init so we can not make
> this easily __initdata.
> It has 64 entries of 8 bytes each so, 512 bytes, I think we can put this
> into the code.

512 bytes is fine.

Andrew

Re: [PATCH bpf] tools/bpftool: fix a percpu_array map dump problem

2018-07-29 Thread Yonghong Song





On 7/28/18 12:14 PM, Daniel Borkmann wrote:

On 07/28/2018 01:11 AM, Yonghong Song wrote:

I hit the following problem when I tried to use bpftool
to dump a percpu array.

   $ sudo ./bpftool map show
   61: percpu_array  name stub  flags 0x0
  key 4B  value 4B  max_entries 1  memlock 4096B
   ...
   $ sudo ./bpftool map dump id 61
   bpftool: malloc.c:2406: sysmalloc: Assertion
   `(old_top == initial_top (av) && old_size == 0) || \
((unsigned long) (old_size) >= MINSIZE && \
prev_inuse (old_top) && \
((unsigned long) old_end & (pagesize - 1)) == 0)'
   failed.
   Aborted

Further debugging revealed that this is due to
miscommunication between bpftool and kernel.
For example, for the above percpu_array with value size of 4B.
The map info returned to user space has value size of 4B.

In bpftool, the values array for lookup is allocated like:
info->value_size * get_possible_cpus() = 4 * get_possible_cpus()
In kernel (kernel/bpf/syscall.c), the values array size is
rounded up to multiple of 8.
round_up(map->value_size, 8) * num_possible_cpus()
= 8 * num_possible_cpus()
So when kernel copies the values to user buffer, the kernel will
overwrite beyond user buffer boundary.

This patch fixed the issue by allocating and stepping through
percpu map value array properly in bpftool.

Fixes: 71bb428fe2c19 ("tools: bpf: add bpftool")
Signed-off-by: Yonghong Song 
---
  tools/bpf/bpftool/map.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 0ee3ba479d87..92bc55f98c4c 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -35,6 +35,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -91,7 +92,8 @@ static bool map_is_map_of_progs(__u32 type)
  static void *alloc_value(struct bpf_map_info *info)
  {
if (map_is_per_cpu(info->type))
-   return malloc(info->value_size * get_possible_cpus());
+   return malloc(round_up(info->value_size, 8) *
+ get_possible_cpus());
else
return malloc(info->value_size);
  }
@@ -273,9 +275,10 @@ static void print_entry_json(struct bpf_map_info *info, 
unsigned char *key,
do_dump_btf(&d, info, key, value);
}
} else {
-   unsigned int i, n;
+   unsigned int i, n, step;
  
  		n = get_possible_cpus();

+   step = round_up(info->value_size, 8);
  
  		jsonw_name(json_wtr, "key");

print_hex_data_json(key, info->key_size);
@@ -288,7 +291,7 @@ static void print_entry_json(struct bpf_map_info *info, 
unsigned char *key,
jsonw_int_field(json_wtr, "cpu", i);
  
  			jsonw_name(json_wtr, "value");

-   print_hex_data_json(value + i * info->value_size,
+   print_hex_data_json(value + i * step,
info->value_size);
  
  			jsonw_end_object(json_wtr);


Fix looks correct to me, but you would also need the same fix in 
print_entry_plain(), no?


Thanks for pointing this out. will submit v2 soon to fix the issue.



Thanks,
Daniel

[PATCH bpf v2] tools/bpftool: fix a percpu_array map dump problem

2018-07-29 Thread Yonghong Song

I hit the following problem when I tried to use bpftool
to dump a percpu array.

  $ sudo ./bpftool map show
  61: percpu_array  name stub  flags 0x0
  key 4B  value 4B  max_entries 1  memlock 4096B
  ...
  $ sudo ./bpftool map dump id 61
  bpftool: malloc.c:2406: sysmalloc: Assertion
  `(old_top == initial_top (av) && old_size == 0) || \
   ((unsigned long) (old_size) >= MINSIZE && \
   prev_inuse (old_top) && \
   ((unsigned long) old_end & (pagesize - 1)) == 0)'
  failed.
  Aborted

Further debugging revealed that this is due to
miscommunication between bpftool and kernel.
For example, for the above percpu_array with value size of 4B.
The map info returned to user space has value size of 4B.

In bpftool, the values array for lookup is allocated like:
   info->value_size * get_possible_cpus() = 4 * get_possible_cpus()
In kernel (kernel/bpf/syscall.c), the values array size is
rounded up to multiple of 8.
   round_up(map->value_size, 8) * num_possible_cpus()
   = 8 * num_possible_cpus()
So when kernel copies the values to user buffer, the kernel will
overwrite beyond user buffer boundary.

This patch fixed the issue by allocating and stepping through
percpu map value array properly in bpftool.

Fixes: 71bb428fe2c19 ("tools: bpf: add bpftool")
Signed-off-by: Yonghong Song 
---
 tools/bpf/bpftool/map.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

Changelogs:
 v1 -> v2:
   . Added missing fix in function print_entry_plain().

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 0ee3ba479d87..0a63842e9cb4 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -91,7 +92,8 @@ static bool map_is_map_of_progs(__u32 type)
 static void *alloc_value(struct bpf_map_info *info)
 {
if (map_is_per_cpu(info->type))
-   return malloc(info->value_size * get_possible_cpus());
+   return malloc(round_up(info->value_size, 8) *
+ get_possible_cpus());
else
return malloc(info->value_size);
 }
@@ -273,9 +275,10 @@ static void print_entry_json(struct bpf_map_info *info, 
unsigned char *key,
do_dump_btf(&d, info, key, value);
}
} else {
-   unsigned int i, n;
+   unsigned int i, n, step;
 
n = get_possible_cpus();
+   step = round_up(info->value_size, 8);
 
jsonw_name(json_wtr, "key");
print_hex_data_json(key, info->key_size);
@@ -288,7 +291,7 @@ static void print_entry_json(struct bpf_map_info *info, 
unsigned char *key,
jsonw_int_field(json_wtr, "cpu", i);
 
jsonw_name(json_wtr, "value");
-   print_hex_data_json(value + i * info->value_size,
+   print_hex_data_json(value + i * step,
info->value_size);
 
jsonw_end_object(json_wtr);
@@ -319,9 +322,10 @@ static void print_entry_plain(struct bpf_map_info *info, 
unsigned char *key,
 
printf("\n");
} else {
-   unsigned int i, n;
+   unsigned int i, n, step;
 
n = get_possible_cpus();
+   step = round_up(info->value_size, 8);
 
printf("key:\n");
fprint_hex(stdout, key, info->key_size, " ");
@@ -329,7 +333,7 @@ static void print_entry_plain(struct bpf_map_info *info, 
unsigned char *key,
for (i = 0; i < n; i++) {
printf("value (CPU %02d):%c",
   i, info->value_size > 16 ? '\n' : ' ');
-   fprint_hex(stdout, value + i * info->value_size,
+   fprint_hex(stdout, value + i * step,
   info->value_size, " ");
printf("\n");
}
-- 
2.14.3

Re: [PATCH 3/4] net: lantiq: Add Lantiq / Intel vrx200 Ethernet driver

2018-07-29 Thread Hauke Mehrtens

On 07/29/2018 06:40 PM, Andrew Lunn wrote:
>> I am thinking about merging this into the switch driver, then we do not
>> have to configure the dependency any more.
> 
> Hi Hauke
> 
> Are there any PHYs which are not part of the switch?

The embedded PHYs are only connected to the switch in this SoC and on
all other SoCs from this line I am aware of.

> Making it part of the switch driver would make sense. Are there any
> backwards compatibility issues? I don't actually see any boards in
> mailine using the compatible strings.

There is currently no device tree file added for any board in mainline.
I would then prefer to add this to the switch driver.

I have to make sure the firmware gets loaded before we scan the MDIO
bus. When no FW is loaded they do not get detected.

More recent SoC have more embedded Ethernet PHYs so I would like to
support a variable number of these PHYs.

The firmware is 64KBytes big and we have to load that into continuous
memory which is then used by the PHY itself. When we are late in the
boot process we could run into memory problems, most devices have 64MB
or 128MB of RAM.

How should the device tree binding should look like?

Should I create an extra sub node:

gswip: gswip@E108000 {
#address-cells = <1>;
#size-cells = <0>;
compatible = "lantiq,xrx200-gswip";
reg = < 0xE108000 0x3000 /* switch */
0xE10B100 0x70 /* mdio */
0xE10B1D8 0x30 /* mii */
>;
dsa,member = <0 0>;

ports {
#address-cells = <1>;
#size-cells = <0>;

port@0 {
reg = <0>;
label = "lan3";
phy-mode = "rgmii";
phy-handle = <&phy0>;
};

};

mdio@0 {
#address-cells = <1>;
#size-cells = <0>;
compatible = "lantiq,xrx200-mdio";
reg = <0>;

phy0: ethernet-phy@0 {
reg = <0x0>;
};

};

# this would be the new part
phys {
gphy0: gphy@20 {
compatible = "lantiq,xrx200a2x-gphy";
reg = <0x20 0x4>;
rcu = <&rcu0>;

resets = <&reset0 31 30>, <&reset1 7 7>;
reset-names = "gphy", "gphy2";
clocks = <&pmu0 XRX200_PMU_GATE_GPHY>;
lantiq,gphy-mode = ;
};

};
};

> Another option would be to write an independent mdio driver, and make
> firmware download part of that. That gives the advantage of supporting
> PHYs which are not part of the switch.
> 
>  Andrew
>

Re: [PATCH 3/4] net: lantiq: Add Lantiq / Intel vrx200 Ethernet driver

2018-07-29 Thread Andrew Lunn

> The embedded PHYs are only connected to the switch in this SoC and on
> all other SoCs from this line I am aware of.

Hi Hauke

O.K, then it makes sense to have it part of the switch driver.

> The firmware is 64KBytes big and we have to load that into continuous
> memory which is then used by the PHY itself. When we are late in the
> boot process we could run into memory problems, most devices have 64MB
> or 128MB of RAM.

You might want to look at using CMA. I've never used it myself, so
cannot help much.

> How should the device tree binding should look like?
> 
> Should I create an extra sub node:
> 
> gswip: gswip@E108000 {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   compatible = "lantiq,xrx200-gswip";
>   reg = < 0xE108000 0x3000 /* switch */
>   0xE10B100 0x70 /* mdio */
>   0xE10B1D8 0x30 /* mii */
>   >;
>   dsa,member = <0 0>;
> 
>   ports {
>   #address-cells = <1>;
>   #size-cells = <0>;
> 
>   port@0 {
>   reg = <0>;
>   label = "lan3";
>   phy-mode = "rgmii";
>   phy-handle = <&phy0>;
>   };
>   
>   };
> 
>   mdio@0 {
>   #address-cells = <1>;
>   #size-cells = <0>;
>   compatible = "lantiq,xrx200-mdio";
>   reg = <0>;
> 
>   phy0: ethernet-phy@0 {
>   reg = <0x0>;
>   };
>   
>   };
> 
>   # this would be the new part
>   phys {
>   gphy0: gphy@20 {
>   compatible = "lantiq,xrx200a2x-gphy";

It would be good to make it clear this is for firmware download. So
scatter "firmware" or "fw" in some of these names. What we don't want
is a mix up with phy's within the mdio subtree. Otherwise this looks
good. But you should cross post the device tree binding to the device
tree mailing list.

Andrew

Re: [net-next v1] net/ipv6: allow any source address for sendmsg pktinfo with ip_nonlocal_bind

2018-07-29 Thread David Miller

From: Vincent Bernat 
Date: Wed, 25 Jul 2018 13:19:13 +0200

> When freebind feature is set of an IPv6 socket, any source address can
> be used when sending UDP datagrams using IPv6 PKTINFO ancillary
> message. Global non-local bind feature was added in commit
> 35a256fee52c ("ipv6: Nonlocal bind") for IPv6. This commit also allows
> IPv6 source address spoofing when non-local bind feature is enabled.
> 
> Signed-off-by: Vincent Bernat 

This definitely seems to make sense.  And is consistent with the other
tests involving freebind and transparent.

This test involving ip_nonlocal_bind, freeebind, and transparent happens
in several locations.  Perhaps we should add a helper function for this?

Thanks.

Re: [PATCHv4 net-next 0/2] route: add support and selftests for directed broadcast forwarding

2018-07-29 Thread David Miller

From: Xin Long 
Date: Fri, 27 Jul 2018 16:37:27 +0800

> Patch 1/2 is the feature and 2/2 is the selftest. Check the changelog
> on each of them to know the details.
> 
> v1->v2:
>   - fix a typo in changelog.
>   - fix an uapi break that Davide noticed.
>   - flush route cache when bc_forwarding is changed.
>   - add the selftest for this patch as Ido's suggestion.
> 
> v2->v3:
>   - fix an incorrect 'if check' in devinet_conf_proc as David Ahern
> noticed.
>   - extend the selftest after one David Ahern fix for vrf.
> 
> v3->v4:
>   - improve the output log in the selftest as David Ahern suggested.

Series applied, thanks Xin.

Re: [PATCH net-next] net: dcb: add DSCP to comment about priority selector types

2018-07-29 Thread David Miller

From: Jakub Kicinski 
Date: Fri, 27 Jul 2018 13:11:00 -0700

> Commit ee2059819450 ("net/dcb: Add dscp to priority selector type")
> added a define for the new DSCP selector type created by
> IEEE 802.1Qcd, but missed the comment enumerating all selector types.
> Update the comment.
> 
> Signed-off-by: Jakub Kicinski 

Applied.

Re: [PATCH net-next 0/3] mtu related changes

2018-07-29 Thread David Miller

From: Stephen Hemminger 
Date: Fri, 27 Jul 2018 13:43:20 -0700

> While looking at other MTU issues, noticed a couple oppurtunties
> for improving user experience.

Series applied, thanks.

Re: [PATCH net-next] selftests: mlxsw: qos_dscp_bridge: Fix

2018-07-29 Thread David Miller

From: Petr Machata 
Date: Sat, 28 Jul 2018 00:48:13 +0200

> There are two problems in this test case:
> 
> - When indexing in bash associative array, the subscript is interpreted as
>   string, not as a variable name to be expanded.
> 
> - The keys stored to t0s and t1s are not DSCP values, but priority +
>   base (i.e. the logical DSCP value, not the full bitfield value).
> 
> In combination these two bugs conspire to make the test just work,
> except it doesn't really test anything and always passes.
> 
> Fix the above two problems in obvious manner.
> 
> Signed-off-by: Petr Machata 

Applied, thanks.

Re: [PATCH 1/5] net: remove bogus RCU annotations on socket.wq

2018-07-29 Thread David Miller

From: Christoph Hellwig 
Date: Fri, 27 Jul 2018 16:02:10 +0200

> We never use RCU protection for it, just a lot of cargo-cult
> rcu_deference_protects calls.
> 
> Note that we do keep the kfree_rcu call for it, as the references through
> struct sock are RCU protected and thus might require a grace period before
> freeing.
> 
> Signed-off-by: Christoph Hellwig 

These were added by Eric Dumazet and I would never accuse him of cargo
cult programming.

All of the rcu_dereference_protects() calls are legit, even though some
of them use '1' as the protects condition because in fact we know the
object is dead and gone through an RCU cycle at that point.

Let's skip this for now.  The rest of your series looks fine so why
don't you resubmit this series with just #2-#5?

Thanks.

Re: [pull request][net-next V2 00/12] Mellanox, mlx5 updates 2018-07-27 (Vxlan updates)

2018-07-29 Thread David Miller

From: Saeed Mahameed 
Date: Fri, 27 Jul 2018 17:06:10 -0700

> This series from Gal and Saeed provides updates to mlx5 vxlan implementation.
> 
> For more information please see tag log below.
> 
> Please pull and let me know if there's any problem.
> 
> V1->V2:
>  - Drop the rw lock patch.

Looks good, pulled, thank you!

Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-29 Thread Michael S. Tsirkin

On Sun, Jul 29, 2018 at 09:00:27AM -0700, David Miller wrote:
> From: Caleb Raitto 
> Date: Mon, 23 Jul 2018 16:11:19 -0700
> 
> > From: Caleb Raitto 
> > 
> > The driver disables tx napi if it's not certain that completions will
> > be processed affine with tx service.
> > 
> > Its heuristic doesn't account for some scenarios where it is, such as
> > when the queue pair count matches the core but not hyperthread count.
> > 
> > Allow userspace to override the heuristic. This is an alternative
> > solution to that in the linked patch. That added more logic in the
> > kernel for these cases, but the agreement was that this was better left
> > to user control.
> > 
> > Do not expand the existing napi_tx variable to a ternary value,
> > because doing so can break user applications that expect
> > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> > 
> > Link: https://patchwork.ozlabs.org/patch/725249/
> > Acked-by: Willem de Bruijn 
> > Acked-by: Jon Olson 
> > Signed-off-by: Caleb Raitto 
> 
> So I looked into the history surrounding these issues.
> 
> First of all, it's always ends up turning out crummy when drivers start
> to set affinities themselves.  The worst possible case is to do it
> _conditionally_, and that is exactly what virtio_net is doing.
> 
> >From the user's perspective, this provides a really bad experience.
> 
> So if I have a 32-queue device and there are 32 cpus, you'll do all
> the affinity settings, stopping Irqbalanced from doing anything
> right?
> 
> So if I add one more cpu, you'll say "oops, no idea what to do in
> this situation" and not touch the affinities at all?
> 
> That makes no sense at all.
> 
> If the driver is going to set affinities at all, OWN that decision
> and set it all the time to something reasonable.
> 
> Or accept that you shouldn't be touching this stuff in the first place
> and leave the affinities alone.
> 
> Right now we're kinda in a situation where the driver has been setting
> affinities in the ncpus==nqueues cases for some time, so we can't stop
> doing it.
> 
> Which means we have to set them in all cases to make the user
> experience sane again.
> 
> I looked at the linked to patch again:
> 
>   https://patchwork.ozlabs.org/patch/725249/
> 
> And I think the strategy should be made more generic, to get rid of
> the hyperthreading assumptions.  I also agree that the "assign
> to first N cpus" logic doesn't make much sense either.
> 
> Just distribute across the available cpus evenly, and be done with it.
> If you have 64 cpus and 32 queues, this assigns queues to every other
> cpu.
> 
> Then we don't need this weird new module parameter.

Can't we set affinity to a set of CPUs?

The point really is that tx irq handler needs a lock on the tx queue to
free up skbs, so processing it on another CPU while tx is active causes
cache line bounces. So we want affinity to CPUs that submit to this
queue on the theory they have these cache line(s) anyway.

I suspect it's not a uniqueue property of virtio.

-- 
MST

Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-29 Thread David Miller

From: "Michael S. Tsirkin" 
Date: Sun, 29 Jul 2018 23:33:20 +0300

> The point really is that tx irq handler needs a lock on the tx queue to
> free up skbs, so processing it on another CPU while tx is active causes
> cache line bounces. So we want affinity to CPUs that submit to this
> queue on the theory they have these cache line(s) anyway.
> 
> I suspect it's not a uniqueue property of virtio.

It certainly is not.

I think the objectives are clear, someone just needs to put them
together cleanly into a patch :)

Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-29 Thread Willem de Bruijn

On Sun, Jul 29, 2018 at 4:48 PM Michael S. Tsirkin  wrote:
>
> On Sun, Jul 29, 2018 at 09:00:27AM -0700, David Miller wrote:
> > From: Caleb Raitto 
> > Date: Mon, 23 Jul 2018 16:11:19 -0700
> >
> > > From: Caleb Raitto 
> > >
> > > The driver disables tx napi if it's not certain that completions will
> > > be processed affine with tx service.
> > >
> > > Its heuristic doesn't account for some scenarios where it is, such as
> > > when the queue pair count matches the core but not hyperthread count.
> > >
> > > Allow userspace to override the heuristic. This is an alternative
> > > solution to that in the linked patch. That added more logic in the
> > > kernel for these cases, but the agreement was that this was better left
> > > to user control.
> > >
> > > Do not expand the existing napi_tx variable to a ternary value,
> > > because doing so can break user applications that expect
> > > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> > >
> > > Link: https://patchwork.ozlabs.org/patch/725249/
> > > Acked-by: Willem de Bruijn 
> > > Acked-by: Jon Olson 
> > > Signed-off-by: Caleb Raitto 
> >
> > So I looked into the history surrounding these issues.
> >
> > First of all, it's always ends up turning out crummy when drivers start
> > to set affinities themselves.  The worst possible case is to do it
> > _conditionally_, and that is exactly what virtio_net is doing.
> >
> > >From the user's perspective, this provides a really bad experience.
> >
> > So if I have a 32-queue device and there are 32 cpus, you'll do all
> > the affinity settings, stopping Irqbalanced from doing anything
> > right?
> >
> > So if I add one more cpu, you'll say "oops, no idea what to do in
> > this situation" and not touch the affinities at all?
> >
> > That makes no sense at all.
> >
> > If the driver is going to set affinities at all, OWN that decision
> > and set it all the time to something reasonable.
> >
> > Or accept that you shouldn't be touching this stuff in the first place
> > and leave the affinities alone.
> >
> > Right now we're kinda in a situation where the driver has been setting
> > affinities in the ncpus==nqueues cases for some time, so we can't stop
> > doing it.
> >
> > Which means we have to set them in all cases to make the user
> > experience sane again.
> >
> > I looked at the linked to patch again:
> >
> >   https://patchwork.ozlabs.org/patch/725249/
> >
> > And I think the strategy should be made more generic, to get rid of
> > the hyperthreading assumptions.  I also agree that the "assign
> > to first N cpus" logic doesn't make much sense either.
> >
> > Just distribute across the available cpus evenly, and be done with it.
> > If you have 64 cpus and 32 queues, this assigns queues to every other
> > cpu.
> >
> > Then we don't need this weird new module parameter.
>
> Can't we set affinity to a set of CPUs?
>
> The point really is that tx irq handler needs a lock on the tx queue to
> free up skbs, so processing it on another CPU while tx is active causes
> cache line bounces. So we want affinity to CPUs that submit to this
> queue on the theory they have these cache line(s) anyway.
>
> I suspect it's not a uniqueue property of virtio.

It is a good heuristic. But as Jon pointed out, there is a trade-off
with other costs, such as increased interrupt load with additional
tx queues. This is particularly stark when multiple hyperthreads
share a cache, but have independent tx queues.

Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-29 Thread Willem de Bruijn

On Sun, Jul 29, 2018 at 12:01 PM David Miller  wrote:
>
> From: Caleb Raitto 
> Date: Mon, 23 Jul 2018 16:11:19 -0700
>
> > From: Caleb Raitto 
> >
> > The driver disables tx napi if it's not certain that completions will
> > be processed affine with tx service.
> >
> > Its heuristic doesn't account for some scenarios where it is, such as
> > when the queue pair count matches the core but not hyperthread count.
> >
> > Allow userspace to override the heuristic. This is an alternative
> > solution to that in the linked patch. That added more logic in the
> > kernel for these cases, but the agreement was that this was better left
> > to user control.
> >
> > Do not expand the existing napi_tx variable to a ternary value,
> > because doing so can break user applications that expect
> > boolean ('Y'/'N') instead of integer output. Add a new param instead.
> >
> > Link: https://patchwork.ozlabs.org/patch/725249/
> > Acked-by: Willem de Bruijn 
> > Acked-by: Jon Olson 
> > Signed-off-by: Caleb Raitto 
>
> So I looked into the history surrounding these issues.
>
> First of all, it's always ends up turning out crummy when drivers start
> to set affinities themselves.  The worst possible case is to do it
> _conditionally_, and that is exactly what virtio_net is doing.
>
> From the user's perspective, this provides a really bad experience.
>
> So if I have a 32-queue device and there are 32 cpus, you'll do all
> the affinity settings, stopping Irqbalanced from doing anything
> right?
>
> So if I add one more cpu, you'll say "oops, no idea what to do in
> this situation" and not touch the affinities at all?
>
> That makes no sense at all.
>
> If the driver is going to set affinities at all, OWN that decision
> and set it all the time to something reasonable.
>
> Or accept that you shouldn't be touching this stuff in the first place
> and leave the affinities alone.
>
> Right now we're kinda in a situation where the driver has been setting
> affinities in the ncpus==nqueues cases for some time, so we can't stop
> doing it.
>
> Which means we have to set them in all cases to make the user
> experience sane again.
>
> I looked at the linked to patch again:
>
> https://patchwork.ozlabs.org/patch/725249/
>
> And I think the strategy should be made more generic, to get rid of
> the hyperthreading assumptions.  I also agree that the "assign
> to first N cpus" logic doesn't make much sense either.
>
> Just distribute across the available cpus evenly, and be done with it.

Sounds good to me.

> If you have 64 cpus and 32 queues, this assigns queues to every other
> cpu.

Striping half the number of queues as cores on a hyperthreaded system
with two logical cores per physical core will allocate all queues on
only half the physical cores.

But it is probably not safe to make any assumptions on virtual to
physical core mapping, anyway, which makes the simplest strategy is
preferable.

Re: [net-next 10/16] net/mlx5: Support PCIe buffer congestion handling via Devlink

2018-07-29 Thread Alexander Duyck

On Sun, Jul 29, 2018 at 2:23 AM, Moshe Shemesh  wrote:
>
>
> On Sat, Jul 28, 2018 at 7:06 PM, Bjorn Helgaas  wrote:
>>
>> On Thu, Jul 26, 2018 at 07:00:20AM -0700, Alexander Duyck wrote:
>> > On Thu, Jul 26, 2018 at 12:14 AM, Jiri Pirko  wrote:
>> > > Thu, Jul 26, 2018 at 02:43:59AM CEST, jakub.kicin...@netronome.com
>> > > wrote:
>> > >>On Wed, 25 Jul 2018 08:23:26 -0700, Alexander Duyck wrote:
>> > >>> On Wed, Jul 25, 2018 at 5:31 AM, Eran Ben Elisha wrote:
>> > >>> > On 7/24/2018 10:51 PM, Jakub Kicinski wrote:
>> > >>>  The devlink params haven't been upstream even for a full cycle
>> > >>>  and
>> > >>>  already you guys are starting to use them to configure standard
>> > >>>  features like queuing.
>> > >>> >>>
>> > >>> >>> We developed the devlink params in order to support non-standard
>> > >>> >>> configuration only. And for non-standard, there are generic and
>> > >>> >>> vendor
>> > >>> >>> specific options.
>> > >>> >>
>> > >>> >> I thought it was developed for performing non-standard and
>> > >>> >> possibly
>> > >>> >> vendor specific configuration.  Look at DEVLINK_PARAM_GENERIC_*
>> > >>> >> for
>> > >>> >> examples of well justified generic options for which we have no
>> > >>> >> other API.  The vendor mlx4 options look fairly vendor specific
>> > >>> >> if you
>> > >>> >> ask me, too.
>> > >>> >>
>> > >>> >> Configuring queuing has an API.  The question is it acceptable to
>> > >>> >> enter
>> > >>> >> into the risky territory of controlling offloads via devlink
>> > >>> >> parameters
>> > >>> >> or would we rather make vendors take the time and effort to model
>> > >>> >> things to (a subset) of existing APIs.  The HW never fits the
>> > >>> >> APIs
>> > >>> >> perfectly.
>> > >>> >
>> > >>> > I understand what you meant here, I would like to highlight that
>> > >>> > this
>> > >>> > mechanism was not meant to handle SRIOV, Representors, etc.
>> > >>> > The vendor specific configuration suggested here is to handle a
>> > >>> > congestion
>> > >>> > state in Multi Host environment (which includes PF and multiple
>> > >>> > VFs per
>> > >>> > host), where one host is not aware to the other hosts, and each is
>> > >>> > running
>> > >>> > on its own pci/driver. It is a device working mode configuration.
>> > >>> >
>> > >>> > This  couldn't fit into any existing API, thus creating this
>> > >>> > vendor specific
>> > >>> > unique API is needed.
>> > >>>
>> > >>> If we are just going to start creating devlink interfaces in for
>> > >>> every
>> > >>> one-off option a device wants to add why did we even bother with
>> > >>> trying to prevent drivers from using sysfs? This just feels like we
>> > >>> are back to the same arguments we had back in the day with it.
>> > >>>
>> > >>> I feel like the bigger question here is if devlink is how we are
>> > >>> going
>> > >>> to deal with all PCIe related features going forward, or should we
>> > >>> start looking at creating a new interface/tool for PCI/PCIe related
>> > >>> features? My concern is that we have already had features such as
>> > >>> DMA
>> > >>> Coalescing that didn't really fit into anything and now we are
>> > >>> starting to see other things related to DMA and PCIe bus credits.
>> > >>> I'm
>> > >>> wondering if we shouldn't start looking at a tool/interface to
>> > >>> configure all the PCIe related features such as interrupts, error
>> > >>> reporting, DMA configuration, power management, etc. Maybe we could
>> > >>> even look at sharing it across subsystems and include things like
>> > >>> storage, graphics, and other subsystems in the conversation.
>> > >>
>> > >>Agreed, for actual PCIe configuration (i.e. not ECN marking) we do
>> > >> need
>> > >>to build up an API.  Sharing it across subsystems would be very cool!
>>
>> I read the thread (starting at [1], for anybody else coming in late)
>> and I see this has something to do with "configuring outbound PCIe
>> buffers", but I haven't seen the connection to PCIe protocol or
>> features, i.e., I can't connect this to anything in the PCIe spec.
>>
>> Can somebody help me understand how the PCI core is relevant?  If
>> there's some connection with a feature defined by PCIe, or if it
>> affects the PCIe transaction protocol somehow, I'm definitely
>> interested in this.  But if this only affects the data transferred
>> over PCIe, i.e., the data payloads of PCIe TLP packets, then I'm not
>> sure why the PCI core should care.
>>
>
>
> As you wrote, this is not a PCIe feature  or affects the PCIe transaction
> protocol.
>
> Actually, due to hardware limitation in current device, we have enabled a
> workaround in hardware.
>
> This mode is proprietary and not relevant to other PCIe devices, thus is set
> using driver-specific parameter in devlink

Essentially what this feature is doing is communicating the need for
PCIe back-pressure to the network fabric. So as the buffers on the
device start to fill because the device isn't able to get back PCIe
credits fast enough it

Re: [PATCH PATCH net-next 10/18] 9p: fix whitespace issues

2018-07-29 Thread Dominique Martinet

(removed tons of Cc to make v9fs-developer@ happy - please do include
linux-ker...@vger.kernel.org next time though, I only saw this patch by
chance snooping at netdev)

Stephen Hemminger wrote on Tue, Jul 24, 2018:
> Remove trailing whitespace and blank lines at EOF
> 
> Signed-off-by: Stephen Hemminger 

LGTM, I'm not sure if someone would pick the whole series but as other
maintainers seem to have taken individual patches I've taken this one as
there's potential conflicts with what I have (planning to remove the
whole net/9p/util.c file)

> ---
>  net/9p/client.c   | 4 ++--
>  net/9p/trans_virtio.c | 2 +-
>  net/9p/util.c | 1 -
>  3 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 5c1343195292..ff02826e0407 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -341,7 +341,7 @@ struct p9_req_t *p9_tag_lookup(struct p9_client *c, u16 
> tag)
>* buffer to read the data into */
>   tag++;
>  
> - if(tag >= c->max_tag) 
> + if (tag >= c->max_tag)
>   return NULL;
>  
>   row = tag / P9_ROW_MAXTAG;
> @@ -1576,7 +1576,7 @@ p9_client_read(struct p9_fid *fid, u64 offset, struct 
> iov_iter *to, int *err)
>   int count = iov_iter_count(to);
>   int rsize, non_zc = 0;
>   char *dataptr;
> - 
> +
>   rsize = fid->iounit;
>   if (!rsize || rsize > clnt->msize-P9_IOHDRSZ)
>   rsize = clnt->msize - P9_IOHDRSZ;
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 05006cbb3361..279b24488d79 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -446,7 +446,7 @@ p9_virtio_zc_request(struct p9_client *client, struct 
> p9_req_t *req,
>   out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
> out_pages, out_nr_pages, offs, outlen);
>   }
> - 
> +
>   /*
>* Take care of in data
>* For example TREAD have 11.
> diff --git a/net/9p/util.c b/net/9p/util.c
> index 59f278e64f58..55ad98277e85 100644
> --- a/net/9p/util.c
> +++ b/net/9p/util.c
> @@ -138,4 +138,3 @@ int p9_idpool_check(int id, struct p9_idpool *p)
>   return idr_find(&p->pool, id) != NULL;
>  }
>  EXPORT_SYMBOL(p9_idpool_check);
> -

-- 
Dominique Martinet

[PATCH net-next v7] net/tls: Use socket data_ready callback on record availability

2018-07-29 Thread Vakul Garg

On receipt of a complete tls record, use socket's saved data_ready
callback instead of state_change callback. In function tls_queue(),
the TLS record is queued in encrypted state. But the decryption
happen inline when tls_sw_recvmsg() or tls_sw_splice_read() get invoked.
So it should be ok to notify the waiting context about the availability
of data as soon as we could collect a full TLS record. For new data
availability notification, sk_data_ready callback is more appropriate.
It points to sock_def_readable() which wakes up specifically for EPOLLIN
event. This is in contrast to the socket callback sk_state_change which
points to sock_def_wakeup() which issues a wakeup unconditionally
(without event mask).

Signed-off-by: Vakul Garg 
---
v6->v7: Improved the commit message to contain the detailed reasoning.
(The same analysis was shared on the mail list.)

 net/tls/tls_sw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 6deceb7c56ba..33838f11fafa 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1028,7 +1028,7 @@ static void tls_queue(struct strparser *strp, struct 
sk_buff *skb)
ctx->recv_pkt = skb;
strp_pause(strp);
 
-   strp->sk->sk_state_change(strp->sk);
+   ctx->saved_data_ready(strp->sk);
 }
 
 static void tls_data_ready(struct sock *sk)
-- 
2.13.6

Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-29 Thread Jason Wang





On 2018年07月25日 08:17, Jon Olson wrote:

On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin  wrote:

On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:

On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin  wrote:

On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:

>From the above linked patch, I understand that there are yet
other special cases in production, such as a hard cap on #tx queues to
32 regardless of number of vcpus.

I don't think upstream kernels have this limit - we can
now use vmalloc for higher number of queues.

Yes. that patch* mentioned it as a google compute engine imposed
limit. It is exactly such cloud provider imposed rules that I'm
concerned about working around in upstream drivers.

* for reference, I mean https://patchwork.ozlabs.org/patch/725249/

Yea. Why does GCE do it btw?

There are a few reasons for the limit, some historical, some current.

Historically we did this because of a kernel limit on the number of
TAP queues (in Montreal I thought this limit was 32). To my chagrin,
the limit upstream at the time we did it was actually eight. We had
increased the limit from eight to 32 internally, and it appears in
upstream it has subsequently increased upstream to 256. We no longer
use TAP for networking, so that constraint no longer applies for us,
but when looking at removing/raising the limit we discovered no
workloads that clearly benefited from lifting it, and it also placed
more pressure on our virtual networking stack particularly on the Tx
side. We left it as-is.

In terms of current reasons there are really two. One is memory usage.
As you know, virtio-net uses rx/tx pairs, so there's an expectation
that the guest will have an Rx queue for every Tx queue. We run our
individual virtqueues fairly deep (4096 entries) to give guests a wide
time window for re-posting Rx buffers and avoiding starvation on
packet delivery. Filling an Rx vring with max-sized mergeable buffers
(4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
be up to 512MB of memory posted for network buffers. Scaling this to
the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
all of the Rx rings full would (in the large average Rx packet size
case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
of RAM available, but I don't believe we've observed a situation where
they would have benefited from having 2.5 gigs of buffers posted for
incoming network traffic :)


We can work to have async txq and rxq instead of paris if there's a 
strong requirement.




The second reason is interrupt related -- as I mentioned above, we
have found no workloads that clearly benefit from so many queues, but
we have found workloads that degrade. In particular workloads that do
a lot of small packet processing but which aren't extremely latency
sensitive can achieve higher PPS by taking fewer interrupt across
fewer VCPUs due to better batching (this also incurs higher latency,
but at the limit the "busy" cores end up suppressing most interrupts
and spending most of their cycles farming out work). Memcache is a
good example here, particularly if the latency targets for request
completion are in the ~milliseconds range (rather than the
microseconds we typically strive for with TCP_RR-style workloads).

All of that said, we haven't been forthcoming with data (and
unfortunately I don't have it handy in a useful form, otherwise I'd
simply post it here), so I understand the hesitation to simply run
with napi_tx across the board. As Willem said, this patch seemed like
the least disruptive way to allow us to continue down the road of
"universal" NAPI Tx and to hopefully get data across enough workloads
(with VMs small, large, and absurdly large :) to present a compelling
argument in one direction or another. As far as I know there aren't
currently any NAPI related ethtool commands (based on a quick perusal
of ethtool.h)


As I suggest before, maybe we can (ab)use tx-frames-irq.

Thanks


-- it seems like it would be fairly involved/heavyweight
to plumb one solely for this unless NAPI Tx is something many users
will want to tune (and for which other drivers would support tuning).

--
Jon Olson

Re: [net-next v1] net/ipv6: allow any source address for sendmsg pktinfo with ip_nonlocal_bind

2018-07-29 Thread Vincent Bernat

 ❦ 29 juillet 2018 12:28 -0700, David Miller  :

>> When freebind feature is set of an IPv6 socket, any source address can
>> be used when sending UDP datagrams using IPv6 PKTINFO ancillary
>> message. Global non-local bind feature was added in commit
>> 35a256fee52c ("ipv6: Nonlocal bind") for IPv6. This commit also allows
>> IPv6 source address spoofing when non-local bind feature is enabled.
>> 
>> Signed-off-by: Vincent Bernat 
>
> This definitely seems to make sense.  And is consistent with the other
> tests involving freebind and transparent.
>
> This test involving ip_nonlocal_bind, freeebind, and transparent happens
> in several locations.  Perhaps we should add a helper function for
> this?

Yes, I can do that. Should I also include one for SCTP?
-- 
"Elves and Dragons!" I says to him.  "Cabbages and potatoes are better
for you and me."
-- J. R. R. Tolkien

RE: Security enhancement proposal for kernel TLS

2018-07-29 Thread Vakul Garg

Sorry for a delayed response.
Kindly see inline.

> -Original Message-
> From: Dave Watson [mailto:davejwat...@fb.com]
> Sent: Wednesday, July 25, 2018 9:30 PM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; Peter Doliwa ; Boris
> Pismenny 
> Subject: Re: Security enhancement proposal for kernel TLS
> 
> You would probably get more responses if you cc the relevant people.
> Comments inline
> 
> On 07/22/18 12:49 PM, Vakul Garg wrote:
> > The kernel based TLS record layer allows the user space world to use a
> decoupled TLS implementation.
> > The applications need not be linked with TLS stack.
> > The TLS handshake can be done by a TLS daemon on the behalf of
> applications.
> >
> > Presently, as soon as the handshake process derives keys, it pushes the
> negotiated keys to kernel TLS .
> > Thereafter the applications can directly read and write data on their TCP
> socket (without having to use SSL apis).
> >
> > With the current kernel TLS implementation, there is a security problem.
> > Since the kernel TLS socket does not have information about the state
> > of handshake, it allows applications to be able to receive data from the
> peer TLS endpoint even when the handshake verification has not been
> completed by the SSL daemon.
> > It is a security problem if applications can receive data if verification 
> > of the
> handshake transcript is not completed (done with processing tls FINISHED
> message).
> >
> > My proposal:
> > - Kernel TLS should maintain state of handshake (verified or
> unverified).
> > In un-verified state, data records should not be allowed pass through
> to the applications.
> >
> > - Add a new control interface using which that the user space SSL
> stack can tell the TLS socket that handshake has been verified and DATA
> records can flow.
> > In 'unverified' state, only control records should be allowed to pass
> and reception DATA record should be pause the receive side record
> decryption.
> 
> It's not entirely clear how your TLS handshake daemon works -   Why is
> it necessary to set the keys in the kernel tls socket before the handshake is
> completed? 

IIUC, with the upstream implementation of tls record layer in kernel, the
decryption of tls FINISHED message happens in kernel. Therefore the keys are
already being sent to kernel tls socket before handshake is completed.

> Or, why do you need to hand off the fd to the client program
> before the handshake is completed?

The fd is always owned by the client program..
The client program opens up the socket, TCP bind/connect it and then
hands it over to SSL stack as a transport handle for exchanging handshake
messages. This is how it works today whether we use kernel TLS or not.
I do not propose to change it.

In my proposal, the applications poll their own tcp socket using read/recvmsg 
etc.
If they get handshake record, they forward it to the entity running handshake 
agent.
The handshake agent could be a linux daemon or could run on a separate security
processor like 'Secure element' or say arm trustzone etc. The applications
forward any handshake message it gets backs from handshake agent to the
connected tcp socket. Therefore, the  applications act as a forwarder of the 
handshake 
messages between the peer tls endpoint and handshake agent.
The received data messages are absorbed by the applications themselves 
(bypassing ssl stack
completely). Similarly, the applications tx data directly by writing on their 
socket.

> Waiting until after handshake solves both of these issues.

The security sensitive check which is 'Wait for handshake to finish completely 
before 
accepting data' should not be the onus of the application. We have enough 
examples
in past where application programmers made mistakes in setting up tls 
correctly. The idea
is to isolate tls session setting up from the applications.

> 
> I'm not aware of any tls libraries that send data before the finished message,
> is there any reason you need to support this?

Sending data records before sending finished message is a protocol error.
A good tls library never does that. But an attacker can exploit it if 
applications can receive
the  data records before handshake is finished. With current kernel TLS, it is 
possible to do so.

Further, as per tls RFC it is ok to piggyback the data records after the 
finished handshake
message. This is called early data. But then it is the responsibility of 
applications to first
complete finished message processing before accepting the data records.

The proposal is to disallow application world seeing data records before 
handshake finishes.

> 
> >
> > - The handshake state should fallback to 'unverified' in case a control
> record is seen again by kernel TLS (e.g. in case of renegotiation, post
> handshake client auth etc).
> 
> Currently kernel tls sockets return an error unless you explicitly handle the
> control record for exactly this reason.

IIRC, any kind handshake message post hands

77 matches

Mail list logo