from:"Antoine Tenart"

[PATCH net 1/2] vxlan: do not modify the shared tunnel info when PMTU triggers an ICMP reply

2021-03-25 Thread Antoine Tenart

When the interface is part of a bridge or an Open vSwitch port and a
packet exceed a PMTU estimate, an ICMP reply is sent to the sender. When
using the external mode (collect metadata) the source and destination
addresses are reversed, so that Open vSwitch can match the packet
against an existing (reverse) flow.

But inverting the source and destination addresses in the shared
ip_tunnel_info will make following packets of the flow to use a wrong
destination address (packets will be tunnelled to itself), if the flow
isn't updated. Which happens with Open vSwitch, until the flow times
out.

Fixes this by uncloning the skb's ip_tunnel_info before inverting its
source and destination addresses, so that the modification will only be
made for the PTMU packet, not the following ones.

Fixes: fc68c99577cc ("vxlan: Support for PMTU discovery on directly bridged 
links")
Tested-by: Eelco Chaudron 
Reviewed-by: Eelco Chaudron 
Signed-off-by: Antoine Tenart 
---
 drivers/net/vxlan.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 666dd201c3d5..53dbc67e8a34 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2725,12 +2725,17 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
goto tx_error;
} else if (err) {
if (info) {
+   struct ip_tunnel_info *unclone;
struct in_addr src, dst;
 
+   unclone = skb_tunnel_info_unclone(skb);
+   if (unlikely(!unclone))
+   goto tx_error;
+
src = remote_ip.sin.sin_addr;
dst = local_ip.sin.sin_addr;
-   info->key.u.ipv4.src = src.s_addr;
-   info->key.u.ipv4.dst = dst.s_addr;
+   unclone->key.u.ipv4.src = src.s_addr;
+   unclone->key.u.ipv4.dst = dst.s_addr;
}
vxlan_encap_bypass(skb, vxlan, vxlan, vni, false);
dst_release(ndst);
@@ -2781,12 +2786,17 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
goto tx_error;
} else if (err) {
if (info) {
+   struct ip_tunnel_info *unclone;
struct in6_addr src, dst;
 
+   unclone = skb_tunnel_info_unclone(skb);
+   if (unlikely(!unclone))
+   goto tx_error;
+
src = remote_ip.sin6.sin6_addr;
dst = local_ip.sin6.sin6_addr;
-   info->key.u.ipv6.src = src;
-   info->key.u.ipv6.dst = dst;
+   unclone->key.u.ipv6.src = src;
+   unclone->key.u.ipv6.dst = dst;
}
 
vxlan_encap_bypass(skb, vxlan, vxlan, vni, false);
-- 
2.30.2

[PATCH net 2/2] geneve: do not modify the shared tunnel info when PMTU triggers an ICMP reply

2021-03-25 Thread Antoine Tenart

When the interface is part of a bridge or an Open vSwitch port and a
packet exceed a PMTU estimate, an ICMP reply is sent to the sender. When
using the external mode (collect metadata) the source and destination
addresses are reversed, so that Open vSwitch can match the packet
against an existing (reverse) flow.

But inverting the source and destination addresses in the shared
ip_tunnel_info will make following packets of the flow to use a wrong
destination address (packets will be tunnelled to itself), if the flow
isn't updated. Which happens with Open vSwitch, until the flow times
out.

Fixes this by uncloning the skb's ip_tunnel_info before inverting its
source and destination addresses, so that the modification will only be
made for the PTMU packet, not the following ones.

Fixes: c1a800e88dbf ("geneve: Support for PMTU discovery on directly bridged 
links")
Tested-by: Eelco Chaudron 
Reviewed-by: Eelco Chaudron 
Signed-off-by: Antoine Tenart 
---
 drivers/net/geneve.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 4ac0373326ef..d5b1e48e0c09 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -908,8 +908,16 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
 
info = skb_tunnel_info(skb);
if (info) {
-   info->key.u.ipv4.dst = fl4.saddr;
-   info->key.u.ipv4.src = fl4.daddr;
+   struct ip_tunnel_info *unclone;
+
+   unclone = skb_tunnel_info_unclone(skb);
+   if (unlikely(!unclone)) {
+   dst_release(&rt->dst);
+   return -ENOMEM;
+   }
+
+   unclone->key.u.ipv4.dst = fl4.saddr;
+   unclone->key.u.ipv4.src = fl4.daddr;
}
 
if (!pskb_may_pull(skb, ETH_HLEN)) {
@@ -993,8 +1001,16 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
struct ip_tunnel_info *info = skb_tunnel_info(skb);
 
if (info) {
-   info->key.u.ipv6.dst = fl6.saddr;
-   info->key.u.ipv6.src = fl6.daddr;
+   struct ip_tunnel_info *unclone;
+
+   unclone = skb_tunnel_info_unclone(skb);
+   if (unlikely(!unclone)) {
+   dst_release(dst);
+   return -ENOMEM;
+   }
+
+   unclone->key.u.ipv6.dst = fl6.saddr;
+   unclone->key.u.ipv6.src = fl6.daddr;
}
 
if (!pskb_may_pull(skb, ETH_HLEN)) {
-- 
2.30.2

[PATCH net 0/2] net: do not modify the shared tunnel info when PMTU triggers an ICMP reply

2021-03-25 Thread Antoine Tenart

Hi,

The series fixes an issue were a shared ip_tunnel_info is modified when
PMTU triggers an ICMP reply in vxlan and geneve, making following
packets in that flow to have a wrong destination address if the flow
isn't updated. A detailled information is given in each of the two
commits.

This was tested manually with OVS and I ran the PTMU selftests with
kmemleak enabled (all OK, none was skipped).

Thanks!
Antoine

Antoine Tenart (2):
  vxlan: do not modify the shared tunnel info when PMTU triggers an ICMP
reply
  geneve: do not modify the shared tunnel info when PMTU triggers an
ICMP reply

 drivers/net/geneve.c | 24 
 drivers/net/vxlan.c  | 18 ++
 2 files changed, 34 insertions(+), 8 deletions(-)

-- 
2.30.2

Re: [PATCH net-next] net-sysfs: remove possible sleep from an RCU read-side critical section

2021-03-22 Thread Antoine Tenart

Quoting Matthew Wilcox (2021-03-22 18:44:21)
> On Mon, Mar 22, 2021 at 06:41:30PM +0100, Antoine Tenart wrote:
> > Quoting Matthew Wilcox (2021-03-22 17:54:39)
> > > -   rcu_read_lock();
> > > -   dev_maps = rcu_dereference(dev->xps_maps[type]);
> > > +   dev_maps = READ_ONCE(dev->xps_maps[type]);
> > 
> > Couldn't dev_maps be freed between here and the read of dev_maps->nr_ids
> > as we're not in an RCU read-side critical section?
> 
> Oh, good point.  Never mind, then.
> 
> > My feeling is there is not much value in having a tricky allocation
> > logic for reads from xps_cpus and xps_rxqs. While we could come up with
> > something, returning -ENOMEM on memory pressure should be fine.
> 
> That's fine.  It's your code, and this is probably a small allocation
> anyway.

All right. Thanks for the suggestions anyway!

Antoine

Re: [PATCH net-next] net-sysfs: remove possible sleep from an RCU read-side critical section

2021-03-22 Thread Antoine Tenart

Quoting Antoine Tenart (2021-03-22 18:41:30)
> Quoting Matthew Wilcox (2021-03-22 17:54:39)
> > On Mon, Mar 22, 2021 at 04:43:29PM +0100, Antoine Tenart wrote:
> > > xps_queue_show is mostly made of an RCU read-side critical section and
> > > calls bitmap_zalloc with GFP_KERNEL in the middle of it. That is not
> > > allowed as this call may sleep and such behaviours aren't allowed in RCU
> > > read-side critical sections. Fix this by using GFP_NOWAIT instead.
> > 
> > This would be another way of fixing the problem that is slightly less
> > complex than my initial proposal, but does allow for using GFP_KERNEL
> > for fewer failures:
> > 
> > @@ -1366,11 +1366,10 @@ static ssize_t xps_queue_show(struct net_device 
> > *dev, unsigned int index,
> >  {
> > struct xps_dev_maps *dev_maps;
> > unsigned long *mask;
> > -   unsigned int nr_ids;
> > +   unsigned int nr_ids, new_nr_ids;
> > int j, len;
> >  
> > -   rcu_read_lock();
> > -   dev_maps = rcu_dereference(dev->xps_maps[type]);
> > +   dev_maps = READ_ONCE(dev->xps_maps[type]);
> 
> Couldn't dev_maps be freed between here and the read of dev_maps->nr_ids
> as we're not in an RCU read-side critical section?

* The first read of dev_maps->nr_ids, happening before rcu_read_lock,
  not the one shown below.

> > /* Default to nr_cpu_ids/dev->num_rx_queues and do not just return 0
> >  * when dev_maps hasn't been allocated yet, to be backward 
> > compatible.
> > @@ -1379,10 +1378,18 @@ static ssize_t xps_queue_show(struct net_device 
> > *dev, unsigned int index,
> >  (type == XPS_CPUS ? nr_cpu_ids : dev->num_rx_queues);
> >  
> > mask = bitmap_zalloc(nr_ids, GFP_KERNEL);
> > -   if (!mask) {
> > -   rcu_read_unlock();
> > +   if (!mask)
> > return -ENOMEM;
> > -   }
> > +
> > +   rcu_read_lock();
> > +   dev_maps = rcu_dereference(dev->xps_maps[type]);
> > +   /* if nr_ids shrank in the meantime, do not overrun array.
> > +* if it increased, we just won't show the new ones
> > +*/
> > +   new_nr_ids = dev_maps ? dev_maps->nr_ids :
> > +   (type == XPS_CPUS ? nr_cpu_ids : 
> > dev->num_rx_queues);
> > +   if (new_nr_ids < nr_ids)
> > +   nr_ids = new_nr_ids;

Re: [PATCH net-next] net-sysfs: remove possible sleep from an RCU read-side critical section

2021-03-22 Thread Antoine Tenart

Quoting Matthew Wilcox (2021-03-22 17:54:39)
> On Mon, Mar 22, 2021 at 04:43:29PM +0100, Antoine Tenart wrote:
> > xps_queue_show is mostly made of an RCU read-side critical section and
> > calls bitmap_zalloc with GFP_KERNEL in the middle of it. That is not
> > allowed as this call may sleep and such behaviours aren't allowed in RCU
> > read-side critical sections. Fix this by using GFP_NOWAIT instead.
> 
> This would be another way of fixing the problem that is slightly less
> complex than my initial proposal, but does allow for using GFP_KERNEL
> for fewer failures:
> 
> @@ -1366,11 +1366,10 @@ static ssize_t xps_queue_show(struct net_device *dev, 
> unsigned int index,
>  {
> struct xps_dev_maps *dev_maps;
> unsigned long *mask;
> -   unsigned int nr_ids;
> +   unsigned int nr_ids, new_nr_ids;
> int j, len;
>  
> -   rcu_read_lock();
> -   dev_maps = rcu_dereference(dev->xps_maps[type]);
> +   dev_maps = READ_ONCE(dev->xps_maps[type]);

Couldn't dev_maps be freed between here and the read of dev_maps->nr_ids
as we're not in an RCU read-side critical section?

> /* Default to nr_cpu_ids/dev->num_rx_queues and do not just return 0
>  * when dev_maps hasn't been allocated yet, to be backward compatible.
> @@ -1379,10 +1378,18 @@ static ssize_t xps_queue_show(struct net_device *dev, 
> unsigned int index,
>  (type == XPS_CPUS ? nr_cpu_ids : dev->num_rx_queues);
>  
> mask = bitmap_zalloc(nr_ids, GFP_KERNEL);
> -   if (!mask) {
> -   rcu_read_unlock();
> +   if (!mask)
> return -ENOMEM;
> -   }
> +
> +   rcu_read_lock();
> +   dev_maps = rcu_dereference(dev->xps_maps[type]);
> +   /* if nr_ids shrank in the meantime, do not overrun array.
> +* if it increased, we just won't show the new ones
> +*/
> +   new_nr_ids = dev_maps ? dev_maps->nr_ids :
> +   (type == XPS_CPUS ? nr_cpu_ids : dev->num_rx_queues);
> +   if (new_nr_ids < nr_ids)
> +   nr_ids = new_nr_ids;
>  
> if (!dev_maps || tc >= dev_maps->num_tc)
> goto out_no_maps;

My feeling is there is not much value in having a tricky allocation
logic for reads from xps_cpus and xps_rxqs. While we could come up with
something, returning -ENOMEM on memory pressure should be fine.

Antoine

[PATCH net-next] net-sysfs: remove possible sleep from an RCU read-side critical section

2021-03-22 Thread Antoine Tenart

xps_queue_show is mostly made of an RCU read-side critical section and
calls bitmap_zalloc with GFP_KERNEL in the middle of it. That is not
allowed as this call may sleep and such behaviours aren't allowed in RCU
read-side critical sections. Fix this by using GFP_NOWAIT instead.

Fixes: 5478fcd0f483 ("net: embed nr_ids in the xps maps")
Reported-by: kernel test robot 
Suggested-by: Matthew Wilcox 
Signed-off-by: Antoine Tenart 
---

Fix sent to net-next as it fixes an issue only in net-next.

 net/core/net-sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 562a42fcd437..f6197774048b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1378,7 +1378,7 @@ static ssize_t xps_queue_show(struct net_device *dev, 
unsigned int index,
nr_ids = dev_maps ? dev_maps->nr_ids :
 (type == XPS_CPUS ? nr_cpu_ids : dev->num_rx_queues);
 
-   mask = bitmap_zalloc(nr_ids, GFP_KERNEL);
+   mask = bitmap_zalloc(nr_ids, GFP_NOWAIT);
if (!mask) {
rcu_read_unlock();
return -ENOMEM;
-- 
2.30.2

[PATCH net-next v4 13/13] net: NULL the old xps map entries when freeing them

2021-03-18 Thread Antoine Tenart

In __netif_set_xps_queue, old map entries from the old dev_maps are
freed but their corresponding entry in the old dev_maps aren't NULLed.
Fix this.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index d5f6ba209f1e..4961fc2e9b19 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2764,6 +2764,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
continue;
}
 
+   RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
kfree_rcu(map, rcu);
}
}
-- 
2.30.2

[PATCH net-next v4 12/13] net: fix use after free in xps

2021-03-18 Thread Antoine Tenart

When setting up an new dev_maps in __netif_set_xps_queue, we remove and
free maps from unused CPUs/rx-queues near the end of the function; by
calling remove_xps_queue. However it's possible those maps are also part
of the old not-freed-yet dev_maps, which might be used concurrently.
When that happens, a map can be freed while its corresponding entry in
the old dev_maps table isn't NULLed, leading to: "BUG: KASAN:
use-after-free" in different places.

This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL
the map entries from the old dev_maps table.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c8ce2dfcc97d..d5f6ba209f1e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2460,7 +2460,7 @@ static DEFINE_MUTEX(xps_map_mutex);
rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
 
 static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
-int tci, u16 index)
+struct xps_dev_maps *old_maps, int tci, u16 index)
 {
struct xps_map *map = NULL;
int pos;
@@ -2479,6 +2479,8 @@ static bool remove_xps_queue(struct xps_dev_maps 
*dev_maps,
break;
}
 
+   if (old_maps)
+   RCU_INIT_POINTER(old_maps->attr_map[tci], NULL);
RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
kfree_rcu(map, rcu);
return false;
@@ -2499,7 +2501,7 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
int i, j;
 
for (i = count, j = offset; i--; j++) {
-   if (!remove_xps_queue(dev_maps, tci, j))
+   if (!remove_xps_queue(dev_maps, NULL, tci, j))
break;
}
 
@@ -2631,7 +2633,7 @@ static void xps_copy_dev_maps(struct xps_dev_maps 
*dev_maps,
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, enum xps_map_type type)
 {
-   struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   struct xps_dev_maps *dev_maps, *new_dev_maps = NULL, *old_dev_maps = 
NULL;
const unsigned long *online_mask = NULL;
bool active = false, copy = false;
int i, j, tci, numa_node_id = -2;
@@ -2766,7 +2768,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
}
}
 
-   kfree_rcu(dev_maps, rcu);
+   old_dev_maps = dev_maps;
 
 out_no_old_maps:
dev_maps = new_dev_maps;
@@ -2792,10 +2794,15 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
netif_attr_test_online(j, online_mask, 
dev_maps->nr_ids))
continue;
 
-   active |= remove_xps_queue(dev_maps, tci, index);
+   active |= remove_xps_queue(dev_maps,
+  copy ? old_dev_maps : NULL,
+  tci, index);
}
}
 
+   if (old_dev_maps)
+   kfree_rcu(old_dev_maps, rcu);
+
/* free map if not active */
if (!active)
reset_xps_maps(dev, dev_maps, type);
-- 
2.30.2

[PATCH net-next v4 10/13] net-sysfs: move the rtnl unlock up in the xps show helpers

2021-03-18 Thread Antoine Tenart

Now that nr_ids and num_tc are stored in the xps dev_maps, which are RCU
protected, we do not have the need to protect the maps in the rtnl lock.
Move the rtnl unlock up so we reduce the rtnl locking section.

We also increase the reference count on the subordinate device if any,
as we don't want this device to be freed while we use it (now that the
rtnl lock isn't protecting it in the whole function).

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 25 +++--
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index ca1f3b63cfad..094fea082649 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1383,10 +1383,14 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
tc = netdev_txq_to_tc(dev, index);
if (tc < 0) {
-   ret = -EINVAL;
-   goto err_rtnl_unlock;
+   rtnl_unlock();
+   return -EINVAL;
}
 
+   /* Make sure the subordinate device can't be freed */
+   get_device(&dev->dev);
+   rtnl_unlock();
+
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_maps[XPS_CPUS]);
nr_ids = dev_maps ? dev_maps->nr_ids : nr_cpu_ids;
@@ -1417,8 +1421,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
 out_no_maps:
rcu_read_unlock();
-
-   rtnl_unlock();
+   put_device(&dev->dev);
 
len = bitmap_print_to_pagebuf(false, buf, mask, nr_ids);
bitmap_free(mask);
@@ -1426,8 +1429,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
 err_rcu_unlock:
rcu_read_unlock();
-err_rtnl_unlock:
-   rtnl_unlock();
+   put_device(&dev->dev);
return ret;
 }
 
@@ -1486,10 +1488,9 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
return restart_syscall();
 
tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0) {
-   ret = -EINVAL;
-   goto err_rtnl_unlock;
-   }
+   rtnl_unlock();
+   if (tc < 0)
+   return -EINVAL;
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_maps[XPS_RXQS]);
@@ -1522,8 +1523,6 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
 out_no_maps:
rcu_read_unlock();
 
-   rtnl_unlock();
-
len = bitmap_print_to_pagebuf(false, buf, mask, nr_ids);
bitmap_free(mask);
 
@@ -1531,8 +1530,6 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
 
 err_rcu_unlock:
rcu_read_unlock();
-err_rtnl_unlock:
-   rtnl_unlock();
return ret;
 }
 
-- 
2.30.2

[PATCH net-next v4 11/13] net-sysfs: move the xps cpus/rxqs retrieval in a common function

2021-03-18 Thread Antoine Tenart

Most of the xps_cpus_show and xps_rxqs_show functions share the same
logic. Having it in two different functions does not help maintenance.
This patch moves their common logic into a new function, xps_queue_show,
to improve this.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 125 +--
 1 file changed, 48 insertions(+), 77 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 094fea082649..562a42fcd437 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1361,44 +1361,27 @@ static const struct attribute_group dql_group = {
 #endif /* CONFIG_BQL */
 
 #ifdef CONFIG_XPS
-static ssize_t xps_cpus_show(struct netdev_queue *queue,
-char *buf)
+static ssize_t xps_queue_show(struct net_device *dev, unsigned int index,
+ int tc, char *buf, enum xps_map_type type)
 {
-   struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned int index, nr_ids;
-   int j, len, ret, tc = 0;
unsigned long *mask;
-
-   if (!netif_is_multiqueue(dev))
-   return -ENOENT;
-
-   index = get_netdev_queue_index(queue);
-
-   if (!rtnl_trylock())
-   return restart_syscall();
-
-   /* If queue belongs to subordinate dev use its map */
-   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
-
-   tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0) {
-   rtnl_unlock();
-   return -EINVAL;
-   }
-
-   /* Make sure the subordinate device can't be freed */
-   get_device(&dev->dev);
-   rtnl_unlock();
+   unsigned int nr_ids;
+   int j, len;
 
rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_maps[XPS_CPUS]);
-   nr_ids = dev_maps ? dev_maps->nr_ids : nr_cpu_ids;
+   dev_maps = rcu_dereference(dev->xps_maps[type]);
+
+   /* Default to nr_cpu_ids/dev->num_rx_queues and do not just return 0
+* when dev_maps hasn't been allocated yet, to be backward compatible.
+*/
+   nr_ids = dev_maps ? dev_maps->nr_ids :
+(type == XPS_CPUS ? nr_cpu_ids : dev->num_rx_queues);
 
mask = bitmap_zalloc(nr_ids, GFP_KERNEL);
if (!mask) {
-   ret = -ENOMEM;
-   goto err_rcu_unlock;
+   rcu_read_unlock();
+   return -ENOMEM;
}
 
if (!dev_maps || tc >= dev_maps->num_tc)
@@ -1421,16 +1404,44 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
 out_no_maps:
rcu_read_unlock();
-   put_device(&dev->dev);
 
len = bitmap_print_to_pagebuf(false, buf, mask, nr_ids);
bitmap_free(mask);
+
return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf)
+{
+   struct net_device *dev = queue->dev;
+   unsigned int index;
+   int len, tc;
+
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
+   index = get_netdev_queue_index(queue);
+
+   if (!rtnl_trylock())
+   return restart_syscall();
+
+   /* If queue belongs to subordinate dev use its map */
+   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
+   tc = netdev_txq_to_tc(dev, index);
+   if (tc < 0) {
+   rtnl_unlock();
+   return -EINVAL;
+   }
+
+   /* Make sure the subordinate device can't be freed */
+   get_device(&dev->dev);
+   rtnl_unlock();
+
+   len = xps_queue_show(dev, index, tc, buf, XPS_CPUS);
 
-err_rcu_unlock:
-   rcu_read_unlock();
put_device(&dev->dev);
-   return ret;
+   return len;
 }
 
 static ssize_t xps_cpus_store(struct netdev_queue *queue,
@@ -1477,10 +1488,8 @@ static struct netdev_queue_attribute xps_cpus_attribute 
__ro_after_init
 static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
 {
struct net_device *dev = queue->dev;
-   struct xps_dev_maps *dev_maps;
-   unsigned int index, nr_ids;
-   int j, len, ret, tc = 0;
-   unsigned long *mask;
+   unsigned int index;
+   int tc;
 
index = get_netdev_queue_index(queue);
 
@@ -1492,45 +1501,7 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
if (tc < 0)
return -EINVAL;
 
-   rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_maps[XPS_RXQS]);
-   nr_ids = dev_maps ? dev_maps->nr_ids : dev->num_rx_queues;
-
-   mask = bitmap_zalloc(nr_ids, GFP_KERNEL);
-   if (!mask) {
-   ret = -ENOMEM;
-   goto err_rcu_unlock;
-   }
-
-   if (!dev_maps || tc >= dev_maps->num_tc)
-   goto out_no_maps;
-
-   for (j = 0; j < nr_ids; j++) {
-   int i, tci = j * dev_maps->num_tc + tc;
-

[PATCH net-next v4 05/13] net: embed nr_ids in the xps maps

2021-03-18 Thread Antoine Tenart

Embed nr_ids (the number of cpu for the xps cpus map, and the number of
rxqs for the xps cpus map) in dev_maps. That will help not accessing out
of bound memory if those values change after dev_maps was allocated.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 include/linux/netdevice.h |  4 
 net/core/dev.c| 45 ++-
 net/core/net-sysfs.c  | 38 +++--
 3 files changed, 47 insertions(+), 40 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c38534c55ea1..09e73f5a8c78 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -772,6 +772,9 @@ struct xps_map {
 /*
  * This structure holds all XPS maps for device.  Maps are indexed by CPU.
  *
+ * We keep track of the number of cpus/rxqs used when the struct is allocated,
+ * in nr_ids. This will help not accessing out-of-bound memory.
+ *
  * We keep track of the number of traffic classes used when the struct is
  * allocated, in num_tc. This will be used to navigate the maps, to ensure 
we're
  * not crossing its upper bound, as the original dev->num_tc can be updated in
@@ -779,6 +782,7 @@ struct xps_map {
  */
 struct xps_dev_maps {
struct rcu_head rcu;
+   unsigned int nr_ids;
s16 num_tc;
struct xps_map __rcu *attr_map[]; /* Either CPUs map or RXQs map */
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index 4e29d1994fdd..7530c95970a0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2524,14 +2524,14 @@ static void reset_xps_maps(struct net_device *dev,
 }
 
 static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
-  struct xps_dev_maps *dev_maps, unsigned int nr_ids,
-  u16 offset, u16 count, bool is_rxqs_map)
+  struct xps_dev_maps *dev_maps, u16 offset, u16 count,
+  bool is_rxqs_map)
 {
+   unsigned int nr_ids = dev_maps->nr_ids;
bool active = false;
int i, j;
 
-   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids),
-j < nr_ids;)
+   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids), j < nr_ids;)
active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
   count);
if (!active)
@@ -2551,7 +2551,6 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 {
const unsigned long *possible_mask = NULL;
struct xps_dev_maps *dev_maps;
-   unsigned int nr_ids;
 
if (!static_key_false(&xps_needed))
return;
@@ -2561,11 +2560,9 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
if (static_key_false(&xps_rxqs_needed)) {
dev_maps = xmap_dereference(dev->xps_rxqs_map);
-   if (dev_maps) {
-   nr_ids = dev->num_rx_queues;
-   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
-  offset, count, true);
-   }
+   if (dev_maps)
+   clean_xps_maps(dev, possible_mask, dev_maps, offset,
+  count, true);
}
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
@@ -2574,9 +2571,7 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
if (num_possible_cpus() > 1)
possible_mask = cpumask_bits(cpu_possible_mask);
-   nr_ids = nr_cpu_ids;
-   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
-  false);
+   clean_xps_maps(dev, possible_mask, dev_maps, offset, count, false);
 
 out_no_maps:
mutex_unlock(&xps_map_mutex);
@@ -2673,11 +2668,12 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
maps_sz = L1_CACHE_BYTES;
 
/* The old dev_maps could be larger or smaller than the one we're
-* setting up now, as dev->num_tc could have been updated in between. We
-* could try to be smart, but let's be safe instead and only copy
-* foreign traffic classes if the two map sizes match.
+* setting up now, as dev->num_tc or nr_ids could have been updated in
+* between. We could try to be smart, but let's be safe instead and only
+* copy foreign traffic classes if the two map sizes match.
 */
-   if (dev_maps && dev_maps->num_tc == num_tc)
+   if (dev_maps &&
+   dev_maps->num_tc == num_tc && dev_maps->nr_ids == nr_ids)
copy = true;
 
/* allocate memory for queue storage */
@@ -2690,6 +2686,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
return -ENOMEM;
}
 
+

[PATCH net-next v4 04/13] net: embed num_tc in the xps maps

2021-03-18 Thread Antoine Tenart

The xps cpus/rxqs map is accessed using dev->num_tc, which is used when
allocating the map. But later updates of dev->num_tc can lead to having
a mismatch between the maps and how they're accessed. In such cases the
map values do not make any sense and out of bound accesses can occur
(that can be easily seen using KASAN).

This patch aims at fixing this by embedding num_tc into the maps, using
the value at the time the map is created. This brings two improvements:
- The maps can be accessed using the embedded num_tc, so we know for
  sure we won't have out of bound accesses.
- Checks can be made before accessing the maps so we know the values
  retrieved will make sense.

We also update __netif_set_xps_queue to conditionally copy old maps from
dev_maps in the new one only if the number of traffic classes from both
maps match.

Signed-off-by: Antoine Tenart 
---
 include/linux/netdevice.h |  6 
 net/core/dev.c| 63 +--
 net/core/net-sysfs.c  | 45 +++-
 3 files changed, 64 insertions(+), 50 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 97254c089eb2..c38534c55ea1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -771,9 +771,15 @@ struct xps_map {
 
 /*
  * This structure holds all XPS maps for device.  Maps are indexed by CPU.
+ *
+ * We keep track of the number of traffic classes used when the struct is
+ * allocated, in num_tc. This will be used to navigate the maps, to ensure 
we're
+ * not crossing its upper bound, as the original dev->num_tc can be updated in
+ * the meantime.
  */
 struct xps_dev_maps {
struct rcu_head rcu;
+   s16 num_tc;
struct xps_map __rcu *attr_map[]; /* Either CPUs map or RXQs map */
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 6bc20eabd2b0..4e29d1994fdd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2491,7 +2491,7 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 struct xps_dev_maps *dev_maps,
 int cpu, u16 offset, u16 count)
 {
-   int num_tc = dev->num_tc ? : 1;
+   int num_tc = dev_maps->num_tc;
bool active = false;
int tci;
 
@@ -2634,10 +2634,10 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
 {
const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   bool active = false, copy = false;
int i, j, tci, numa_node_id = -2;
int maps_sz, num_tc = 1, tc = 0;
struct xps_map *map, *new_map;
-   bool active = false;
unsigned int nr_ids;
 
if (dev->num_tc) {
@@ -2672,19 +2672,29 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (maps_sz < L1_CACHE_BYTES)
maps_sz = L1_CACHE_BYTES;
 
+   /* The old dev_maps could be larger or smaller than the one we're
+* setting up now, as dev->num_tc could have been updated in between. We
+* could try to be smart, but let's be safe instead and only copy
+* foreign traffic classes if the two map sizes match.
+*/
+   if (dev_maps && dev_maps->num_tc == num_tc)
+   copy = true;
+
/* allocate memory for queue storage */
for (j = -1; j = netif_attrmask_next_and(j, online_mask, mask, nr_ids),
 j < nr_ids;) {
-   if (!new_dev_maps)
-   new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
if (!new_dev_maps) {
-   mutex_unlock(&xps_map_mutex);
-   return -ENOMEM;
+   new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
+   if (!new_dev_maps) {
+   mutex_unlock(&xps_map_mutex);
+   return -ENOMEM;
+   }
+
+   new_dev_maps->num_tc = num_tc;
}
 
tci = j * num_tc + tc;
-   map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
-NULL;
+   map = copy ? xmap_dereference(dev_maps->attr_map[tci]) : NULL;
 
map = expand_xps_map(map, j, index, is_rxqs_map);
if (!map)
@@ -2706,7 +2716,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 j < nr_ids;) {
/* copy maps belonging to foreign traffic classes */
-   for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
+   for (i = tc, tci = j * num_tc; copy && i--; tci++) {
/* fill in the new device map from the old device map */
map

[PATCH net-next v4 02/13] net-sysfs: store the return of get_netdev_queue_index in an unsigned int

2021-03-18 Thread Antoine Tenart

In net-sysfs, get_netdev_queue_index returns an unsigned int. Some of
its callers use an unsigned long to store the returned value. Update the
code to be consistent, this should only be cosmetic.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 3a083c0c9dd3..5dc4223f6b68 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1367,7 +1367,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
@@ -1437,7 +1438,7 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
  const char *buf, size_t len)
 {
struct net_device *dev = queue->dev;
-   unsigned long index;
+   unsigned int index;
cpumask_var_t mask;
int err;
 
@@ -1479,7 +1480,8 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
int j, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
 
index = get_netdev_queue_index(queue);
 
@@ -1541,7 +1543,8 @@ static ssize_t xps_rxqs_store(struct netdev_queue *queue, 
const char *buf,
 {
struct net_device *dev = queue->dev;
struct net *net = dev_net(dev);
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
int err;
 
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
-- 
2.30.2

[PATCH net-next v4 00/13] net: xps: improve the xps maps handling

2021-03-18 Thread Antoine Tenart

Hello,

This series aims at fixing various issues with the xps code, including
out-of-bound accesses and use-after-free. While doing so we try to
improve the xps code maintainability and readability.

The main change is moving dev->num_tc and dev->nr_ids in the xps maps, to
avoid out-of-bound accesses as those two fields can be updated after the
maps have been allocated. This allows further reworks, to improve the
xps code readability and allow to stop taking the rtnl lock when
reading the maps in sysfs. The maps are moved to an array in net_device,
which simplifies the code a lot.

One future improvement may be to remove the use of xps_map_mutex from
net/core/dev.c, but that may require extra care.

Thanks!
Antoine

Since v3:
  - Removed the 3 patches about the rtnl lock and __netif_set_xps_queue
as there are extra issues. Those patches were not tied to the
others, and I'll see want can be done as a separate effort.
  - One small fix in patch 12.

Since v2:
  - Patches 13-16 are new to the series.
  - Fixed another issue I found while preparing v3 (use after free of
old xps maps).
  - Kept the rtnl lock when calling netdev_get_tx_queue and
netdev_txq_to_tc.
  - Use get_device/put_device when using the sb_dev.
  - Take the rtnl lock in mlx5 and virtio_net when calling
netif_set_xps_queue.
  - Fixed a coding style issue.

Since v1:
  - Reordered the patches to improve readability and avoid introducing
issues in between patches.
  - Use dev_maps->nr_ids to allocate the mask in xps_queue_show but
still default to nr_cpu_ids/dev->num_rx_queues in xps_queue_show
when dev_maps hasn't been allocated yet for backward
    compatibility.:w


Antoine Tenart (13):
  net-sysfs: convert xps_cpus_show to bitmap_zalloc
  net-sysfs: store the return of get_netdev_queue_index in an unsigned
int
  net-sysfs: make xps_cpus_show and xps_rxqs_show consistent
  net: embed num_tc in the xps maps
  net: embed nr_ids in the xps maps
  net: remove the xps possible_mask
  net: move the xps maps to an array
  net: add an helper to copy xps maps to the new dev_maps
  net: improve queue removal readability in __netif_set_xps_queue
  net-sysfs: move the rtnl unlock up in the xps show helpers
  net-sysfs: move the xps cpus/rxqs retrieval in a common function
  net: fix use after free in xps
  net: NULL the old xps map entries when freeing them

 drivers/net/virtio_net.c  |   2 +-
 include/linux/netdevice.h |  27 -
 net/core/dev.c| 247 --
 net/core/net-sysfs.c  | 177 +++
 4 files changed, 222 insertions(+), 231 deletions(-)

-- 
2.30.2

[PATCH net-next v4 01/13] net-sysfs: convert xps_cpus_show to bitmap_zalloc

2021-03-18 Thread Antoine Tenart

Use bitmap_zalloc instead of zalloc_cpumask_var in xps_cpus_show to
align with xps_rxqs_show. This will improve maintenance and allow us to
factorize the two functions. The function should behave the same.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 307628fdf380..3a083c0c9dd3 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1367,8 +1367,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   cpumask_var_t mask;
-   unsigned long index;
+   unsigned long *mask, index;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
@@ -1396,7 +1395,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
}
 
-   if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
+   if (!mask) {
ret = -ENOMEM;
goto err_rtnl_unlock;
}
@@ -1414,7 +1414,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
for (i = map->len; i--;) {
if (map->queues[i] == index) {
-   cpumask_set_cpu(cpu, mask);
+   set_bit(cpu, mask);
break;
}
}
@@ -1424,8 +1424,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
rtnl_unlock();
 
-   len = snprintf(buf, PAGE_SIZE, "%*pb\n", cpumask_pr_args(mask));
-   free_cpumask_var(mask);
+   len = bitmap_print_to_pagebuf(false, buf, mask, nr_cpu_ids);
+   bitmap_free(mask);
return len < PAGE_SIZE ? len : -EINVAL;
 
 err_rtnl_unlock:
-- 
2.30.2

[PATCH net-next v4 07/13] net: move the xps maps to an array

2021-03-18 Thread Antoine Tenart

Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.

This should not modify the xps maps behaviour in any way.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 drivers/net/virtio_net.c  |  2 +-
 include/linux/netdevice.h | 17 +
 net/core/dev.c| 73 +--
 net/core/net-sysfs.c  |  6 ++--
 4 files changed, 46 insertions(+), 52 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 77ba8e2fc11c..584a9bd59dda 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2015,7 +2015,7 @@ static void virtnet_set_affinity(struct virtnet_info *vi)
}
virtqueue_set_affinity(vi->rq[i].vq, mask);
virtqueue_set_affinity(vi->sq[i].vq, mask);
-   __netif_set_xps_queue(vi->dev, cpumask_bits(mask), i, false);
+   __netif_set_xps_queue(vi->dev, cpumask_bits(mask), i, XPS_CPUS);
cpumask_clear(mask);
}
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 09e73f5a8c78..494050be 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -754,6 +754,13 @@ struct rx_queue_attribute {
 const char *buf, size_t len);
 };
 
+/* XPS map type and offset of the xps map within net_device->xps_maps[]. */
+enum xps_map_type {
+   XPS_CPUS = 0,
+   XPS_RXQS,
+   XPS_MAPS_MAX,
+};
+
 #ifdef CONFIG_XPS
 /*
  * This structure holds an XPS map which can be of variable length.  The
@@ -1773,8 +1780,7 @@ enum netdev_ml_priv_type {
  * @tx_queue_len:  Max frames per queue allowed
  * @tx_global_lock:XXX: need comments on this one
  * @xdp_bulkq: XDP device bulk queue
- * @xps_cpus_map:  all CPUs map for XPS device
- * @xps_rxqs_map:  all RXQs map for XPS device
+ * @xps_maps:  all CPUs/RXQs maps for XPS device
  *
  * @xps_maps:  XXX: need comments on this one
  * @miniq_egress:  clsact qdisc specific data for
@@ -2070,8 +2076,7 @@ struct net_device {
struct xdp_dev_bulk_queue __percpu *xdp_bulkq;
 
 #ifdef CONFIG_XPS
-   struct xps_dev_maps __rcu *xps_cpus_map;
-   struct xps_dev_maps __rcu *xps_rxqs_map;
+   struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
 #endif
 #ifdef CONFIG_NET_CLS_ACT
struct mini_Qdisc __rcu *miniq_egress;
@@ -3701,7 +3706,7 @@ static inline void netif_wake_subqueue(struct net_device 
*dev, u16 queue_index)
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
u16 index);
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
- u16 index, bool is_rxqs_map);
+ u16 index, enum xps_map_type type);
 
 /**
  * netif_attr_test_mask - Test a CPU or Rx queue set in a mask
@@ -3796,7 +3801,7 @@ static inline int netif_set_xps_queue(struct net_device 
*dev,
 
 static inline int __netif_set_xps_queue(struct net_device *dev,
const unsigned long *mask,
-   u16 index, bool is_rxqs_map)
+   u16 index, enum xps_map_type type)
 {
return 0;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 3ed8cb3a4061..af57e32bb543 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2511,31 +2511,34 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 
 static void reset_xps_maps(struct net_device *dev,
   struct xps_dev_maps *dev_maps,
-  bool is_rxqs_map)
+  enum xps_map_type type)
 {
-   if (is_rxqs_map) {
-   static_key_slow_dec_cpuslocked(&xps_rxqs_needed);
-   RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
-   } else {
-   RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
-   }
static_key_slow_dec_cpuslocked(&xps_needed);
+   if (type == XPS_RXQS)
+   static_key_slow_dec_cpuslocked(&xps_rxqs_needed);
+
+   RCU_INIT_POINTER(dev->xps_maps[type], NULL);
+
kfree_rcu(dev_maps, rcu);
 }
 
-static void clean_xps_maps(struct net_device *dev,
-  struct xps_dev_maps *dev_maps, u16 offset, u16 count,
-  bool is_rxqs_map)
+static void clean_xps_maps(struct net_device *dev, enum xps_map_type type,
+  u16 offset, u16 count)
 {
+   struct xps_dev_maps *dev_maps;
bool active = false;
int i, j;
 
+   dev_maps = xmap_dereference(dev->xps_maps[type]);
+   if (!dev_maps)
+   return;
+
for (j = 0; j < dev_maps->nr_ids; j++)

[PATCH net-next v4 06/13] net: remove the xps possible_mask

2021-03-18 Thread Antoine Tenart

Remove the xps possible_mask. It was an optimization but we can just
loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That
simplifies the code a bit.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 net/core/dev.c   | 40 +---
 net/core/net-sysfs.c |  4 ++--
 2 files changed, 15 insertions(+), 29 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7530c95970a0..3ed8cb3a4061 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2523,33 +2523,28 @@ static void reset_xps_maps(struct net_device *dev,
kfree_rcu(dev_maps, rcu);
 }
 
-static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
+static void clean_xps_maps(struct net_device *dev,
   struct xps_dev_maps *dev_maps, u16 offset, u16 count,
   bool is_rxqs_map)
 {
-   unsigned int nr_ids = dev_maps->nr_ids;
bool active = false;
int i, j;
 
-   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids), j < nr_ids;)
-   active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
-  count);
+   for (j = 0; j < dev_maps->nr_ids; j++)
+   active |= remove_xps_queue_cpu(dev, dev_maps, j, offset, count);
if (!active)
reset_xps_maps(dev, dev_maps, is_rxqs_map);
 
if (!is_rxqs_map) {
-   for (i = offset + (count - 1); count--; i--) {
+   for (i = offset + (count - 1); count--; i--)
netdev_queue_numa_node_write(
-   netdev_get_tx_queue(dev, i),
-   NUMA_NO_NODE);
-   }
+   netdev_get_tx_queue(dev, i), NUMA_NO_NODE);
}
 }
 
 static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
   u16 count)
 {
-   const unsigned long *possible_mask = NULL;
struct xps_dev_maps *dev_maps;
 
if (!static_key_false(&xps_needed))
@@ -2561,17 +2556,14 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
if (static_key_false(&xps_rxqs_needed)) {
dev_maps = xmap_dereference(dev->xps_rxqs_map);
if (dev_maps)
-   clean_xps_maps(dev, possible_mask, dev_maps, offset,
-  count, true);
+   clean_xps_maps(dev, dev_maps, offset, count, true);
}
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
if (!dev_maps)
goto out_no_maps;
 
-   if (num_possible_cpus() > 1)
-   possible_mask = cpumask_bits(cpu_possible_mask);
-   clean_xps_maps(dev, possible_mask, dev_maps, offset, count, false);
+   clean_xps_maps(dev, dev_maps, offset, count, false);
 
 out_no_maps:
mutex_unlock(&xps_map_mutex);
@@ -2627,8 +2619,8 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map)
 {
-   const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   const unsigned long *online_mask = NULL;
bool active = false, copy = false;
int i, j, tci, numa_node_id = -2;
int maps_sz, num_tc = 1, tc = 0;
@@ -2656,10 +2648,8 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
nr_ids = dev->num_rx_queues;
} else {
maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
-   if (num_possible_cpus() > 1) {
+   if (num_possible_cpus() > 1)
online_mask = cpumask_bits(cpu_online_mask);
-   possible_mask = cpumask_bits(cpu_possible_mask);
-   }
dev_maps = xmap_dereference(dev->xps_cpus_map);
nr_ids = nr_cpu_ids;
}
@@ -2710,8 +2700,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
static_key_slow_inc_cpuslocked(&xps_rxqs_needed);
}
 
-   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
-j < nr_ids;) {
+   for (j = 0; j < nr_ids; j++) {
/* copy maps belonging to foreign traffic classes */
for (i = tc, tci = j * num_tc; copy && i--; tci++) {
/* fill in the new device map from the old device map */
@@ -2766,8 +2755,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (!dev_maps)
goto out_no_old_maps;
 
-   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
-j < dev_maps->nr_ids;) {
+   for (j = 0; j < dev_maps->

[PATCH net-next v4 03/13] net-sysfs: make xps_cpus_show and xps_rxqs_show consistent

2021-03-18 Thread Antoine Tenart

Make the implementations of xps_cpus_show and xps_rxqs_show to converge,
as the two share the same logic but diverted over time. This should not
modify their behaviour but will help future changes and improve
maintenance.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 5dc4223f6b68..5f76183ad5bc 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1364,7 +1364,7 @@ static const struct attribute_group dql_group = {
 static ssize_t xps_cpus_show(struct netdev_queue *queue,
 char *buf)
 {
-   int cpu, len, ret, num_tc = 1, tc = 0;
+   int j, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
unsigned long *mask;
@@ -1404,23 +1404,26 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_cpus_map);
-   if (dev_maps) {
-   for_each_possible_cpu(cpu) {
-   int i, tci = cpu * num_tc + tc;
-   struct xps_map *map;
-
-   map = rcu_dereference(dev_maps->attr_map[tci]);
-   if (!map)
-   continue;
-
-   for (i = map->len; i--;) {
-   if (map->queues[i] == index) {
-   set_bit(cpu, mask);
-   break;
-   }
+   if (!dev_maps)
+   goto out_no_maps;
+
+   for (j = -1; j = netif_attrmask_next(j, NULL, nr_cpu_ids),
+j < nr_cpu_ids;) {
+   int i, tci = j * num_tc + tc;
+   struct xps_map *map;
+
+   map = rcu_dereference(dev_maps->attr_map[tci]);
+   if (!map)
+   continue;
+
+   for (i = map->len; i--;) {
+   if (map->queues[i] == index) {
+   set_bit(j, mask);
+   break;
}
}
}
+out_no_maps:
rcu_read_unlock();
 
rtnl_unlock();
-- 
2.30.2

[PATCH net-next v4 08/13] net: add an helper to copy xps maps to the new dev_maps

2021-03-18 Thread Antoine Tenart

This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps
to new_dev_maps at a given index. The logic should be the same, with an
improved code readability and maintenance.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index af57e32bb543..00f6b41e11d8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2608,6 +2608,25 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
return new_map;
 }
 
+/* Copy xps maps at a given index */
+static void xps_copy_dev_maps(struct xps_dev_maps *dev_maps,
+ struct xps_dev_maps *new_dev_maps, int index,
+ int tc, bool skip_tc)
+{
+   int i, tci = index * dev_maps->num_tc;
+   struct xps_map *map;
+
+   /* copy maps belonging to foreign traffic classes */
+   for (i = 0; i < dev_maps->num_tc; i++, tci++) {
+   if (i == tc && skip_tc)
+   continue;
+
+   /* fill in the new device map from the old device map */
+   map = xmap_dereference(dev_maps->attr_map[tci]);
+   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
+   }
+}
+
 /* Must be called under cpus_read_lock */
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, enum xps_map_type type)
@@ -2694,23 +2713,16 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
}
 
for (j = 0; j < nr_ids; j++) {
-   /* copy maps belonging to foreign traffic classes */
-   for (i = tc, tci = j * num_tc; copy && i--; tci++) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
-   }
+   bool skip_tc = false;
 
-   /* We need to explicitly update tci as prevous loop
-* could break out early if dev_maps is NULL.
-*/
tci = j * num_tc + tc;
-
if (netif_attr_test_mask(j, mask, nr_ids) &&
netif_attr_test_online(j, online_mask, nr_ids)) {
/* add tx-queue to CPU/rx-queue maps */
int pos = 0;
 
+   skip_tc = true;
+
map = xmap_dereference(new_dev_maps->attr_map[tci]);
while ((pos < map->len) && (map->queues[pos] != index))
pos++;
@@ -2725,18 +2737,11 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
numa_node_id = -1;
}
 #endif
-   } else if (copy) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
 
-   /* copy maps belonging to foreign traffic classes */
-   for (i = num_tc - tc, tci++; copy && --i; tci++) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
-   }
+   if (copy)
+   xps_copy_dev_maps(dev_maps, new_dev_maps, j, tc,
+ skip_tc);
}
 
rcu_assign_pointer(dev->xps_maps[type], new_dev_maps);
-- 
2.30.2

[PATCH net-next v4 09/13] net: improve queue removal readability in __netif_set_xps_queue

2021-03-18 Thread Antoine Tenart

Improve the readability of the loop removing tx-queue from unused
CPUs/rx-queues in __netif_set_xps_queue. The change should only be
cosmetic.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 00f6b41e11d8..c8ce2dfcc97d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2784,13 +2784,16 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
 
/* removes tx-queue from unused CPUs/rx-queues */
for (j = 0; j < dev_maps->nr_ids; j++) {
-   for (i = tc, tci = j * dev_maps->num_tc; i--; tci++)
-   active |= remove_xps_queue(dev_maps, tci, index);
-   if (!netif_attr_test_mask(j, mask, dev_maps->nr_ids) ||
-   !netif_attr_test_online(j, online_mask, dev_maps->nr_ids))
-   active |= remove_xps_queue(dev_maps, tci, index);
-   for (i = dev_maps->num_tc - tc, tci++; --i; tci++)
+   tci = j * dev_maps->num_tc;
+
+   for (i = 0; i < dev_maps->num_tc; i++, tci++) {
+   if (i == tc &&
+   netif_attr_test_mask(j, mask, dev_maps->nr_ids) &&
+   netif_attr_test_online(j, online_mask, 
dev_maps->nr_ids))
+   continue;
+
active |= remove_xps_queue(dev_maps, tci, index);
+   }
}
 
/* free map if not active */
-- 
2.30.2

Re: [PATCH net-next v3 15/16] net/mlx5e: take the rtnl lock when calling netif_set_xps_queue

2021-03-17 Thread Antoine Tenart

Quoting Saeed Mahameed (2021-03-12 21:54:18)
> On Fri, 2021-03-12 at 16:04 +0100, Antoine Tenart wrote:
> > netif_set_xps_queue must be called with the rtnl lock taken, and this
> > is
> > now enforced using ASSERT_RTNL(). mlx5e_attach_netdev was taking the
> > lock conditionally, fix this by taking the rtnl lock all the time.
> 
> There is a reason why it is conditional:
> we had a bug in the past of double locking here:
> 
> [ 4255.283960] echo/644 is trying to acquire lock:
> 
>  [ 4255.285092] 85101f90 (rtnl_mutex){+..}, at:
> mlx5e_attach_netdev0xd4/0×3d0 [mlx5_core]
> 
>  [ 4255.287264] 
> 
>  [ 4255.287264] but task is already holding lock:
> 
>  [ 4255.288971] 85101f90 (rtnl_mutex){+..}, at:
> ipoib_vlan_add0×7c/0×2d0 [ib_ipoib]
> 
> ipoib_vlan_add is called under rtnl and will eventually call 
> mlx5e_attach_netdev, we don't have much control over this in mlx5
> driver since the rdma stack provides a per-prepared netdev to attach to
> our hw. maybe it is time we had a nested rtnl lock .. 

Not sure we want to add a nested rtnl lock because of xps. I'd like to
see other options first. Could be having a locking mechanism for xps not
relying on rtnl; if that's possible.

As for this series, patches 6, 15 (this one) and 16 are not linked to
and do not rely on the other patches. They're improvement or fixes for
already existing behaviours. The series already gained enough new
patches since v1 and I don't want to maintain it out-of-tree for too
long, so I'll resend it without patches 6, 15 and 16; and then we'll be
able to focus on the xps locking relationship with rtnl.

Antoine

Re: [PATCH net-next v3 15/16] net/mlx5e: take the rtnl lock when calling netif_set_xps_queue

2021-03-15 Thread Antoine Tenart

Quoting Maxim Mikityanskiy (2021-03-15 15:53:02)
> On 2021-03-15 10:38, Antoine Tenart wrote:
> > Quoting Saeed Mahameed (2021-03-12 21:54:18)
> >> There is a reason why it is conditional:
> >> we had a bug in the past of double locking here:
> >>
> >> [ 4255.283960] echo/644 is trying to acquire lock:
> >>
> >>   [ 4255.285092] 85101f90 (rtnl_mutex){+..}, at:
> >> mlx5e_attach_netdev0xd4/0×3d0 [mlx5_core]
> >>
> >>   [ 4255.287264]
> >>
> >>   [ 4255.287264] but task is already holding lock:
> >>
> >>   [ 4255.288971] 85101f90 (rtnl_mutex){+..}, at:
> >> ipoib_vlan_add0×7c/0×2d0 [ib_ipoib]
> >>
> >> ipoib_vlan_add is called under rtnl and will eventually call
> >> mlx5e_attach_netdev, we don't have much control over this in mlx5
> >> driver since the rdma stack provides a per-prepared netdev to attach to
> >> our hw. maybe it is time we had a nested rtnl lock ..
> > 
> > Thanks for the explanation. So as you said, we can't based the locking
> > decision only on the driver own state / information...
> > 
> > What about `take_rtnl = !rtnl_is_locked();`?
> 
> It won't work, because the lock may be taken by some other unrelated 
> thread. By doing `if (!rtnl_is_locked()) rtnl_lock()` we defeat the 
> purpose of the lock, because we will proceed to the critical section 
> even if we should wait until some other thread releases the lock.

Ah, that's right...

Re: [PATCH net-next v3 15/16] net/mlx5e: take the rtnl lock when calling netif_set_xps_queue

2021-03-15 Thread Antoine Tenart

Quoting Saeed Mahameed (2021-03-12 21:54:18)
> On Fri, 2021-03-12 at 16:04 +0100, Antoine Tenart wrote:
> > netif_set_xps_queue must be called with the rtnl lock taken, and this
> > is
> > now enforced using ASSERT_RTNL(). mlx5e_attach_netdev was taking the
> > lock conditionally, fix this by taking the rtnl lock all the time.
> > 
> > Signed-off-by: Antoine Tenart 
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 11 +++
> >  1 file changed, 3 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index ec2fcb2a2977..96cba86b9f0d 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -5557,7 +5557,6 @@ static void mlx5e_update_features(struct
> > net_device *netdev)
> >  
> >  int mlx5e_attach_netdev(struct mlx5e_priv *priv)
> >  {
> > -   const bool take_rtnl = priv->netdev->reg_state ==
> > NETREG_REGISTERED;
> > const struct mlx5e_profile *profile = priv->profile;
> > int max_nch;
> > int err;
> > @@ -5578,15 +5577,11 @@ int mlx5e_attach_netdev(struct mlx5e_priv
> > *priv)
> >  * 2. Set our default XPS cpumask.
> >  * 3. Build the RQT.
> >  *
> > -    * rtnl_lock is required by netif_set_real_num_*_queues in case
> > the
> > -    * netdev has been registered by this point (if this function
> > was called
> > -    * in the reload or resume flow).
> > +    * rtnl_lock is required by netif_set_xps_queue.
> >  */
> 
> There is a reason why it is conditional:
> we had a bug in the past of double locking here:
> 
> [ 4255.283960] echo/644 is trying to acquire lock:
> 
>  [ 4255.285092] 85101f90 (rtnl_mutex){+..}, at:
> mlx5e_attach_netdev0xd4/0×3d0 [mlx5_core]
> 
>  [ 4255.287264] 
> 
>  [ 4255.287264] but task is already holding lock:
> 
>  [ 4255.288971] 85101f90 (rtnl_mutex){+..}, at:
> ipoib_vlan_add0×7c/0×2d0 [ib_ipoib]
> 
> ipoib_vlan_add is called under rtnl and will eventually call 
> mlx5e_attach_netdev, we don't have much control over this in mlx5
> driver since the rdma stack provides a per-prepared netdev to attach to
> our hw. maybe it is time we had a nested rtnl lock .. 

Thanks for the explanation. So as you said, we can't based the locking
decision only on the driver own state / information...

What about `take_rtnl = !rtnl_is_locked();`?

Thanks!
Antoine

[PATCH net-next v3 15/16] net/mlx5e: take the rtnl lock when calling netif_set_xps_queue

2021-03-12 Thread Antoine Tenart

netif_set_xps_queue must be called with the rtnl lock taken, and this is
now enforced using ASSERT_RTNL(). mlx5e_attach_netdev was taking the
lock conditionally, fix this by taking the rtnl lock all the time.

Signed-off-by: Antoine Tenart 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ec2fcb2a2977..96cba86b9f0d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5557,7 +5557,6 @@ static void mlx5e_update_features(struct net_device 
*netdev)
 
 int mlx5e_attach_netdev(struct mlx5e_priv *priv)
 {
-   const bool take_rtnl = priv->netdev->reg_state == NETREG_REGISTERED;
const struct mlx5e_profile *profile = priv->profile;
int max_nch;
int err;
@@ -5578,15 +5577,11 @@ int mlx5e_attach_netdev(struct mlx5e_priv *priv)
 * 2. Set our default XPS cpumask.
 * 3. Build the RQT.
 *
-* rtnl_lock is required by netif_set_real_num_*_queues in case the
-* netdev has been registered by this point (if this function was called
-* in the reload or resume flow).
+* rtnl_lock is required by netif_set_xps_queue.
 */
-   if (take_rtnl)
-   rtnl_lock();
+   rtnl_lock();
err = mlx5e_num_channels_changed(priv);
-   if (take_rtnl)
-   rtnl_unlock();
+   rtnl_unlock();
if (err)
goto out;
 
-- 
2.29.2

[PATCH net-next v3 16/16] virtio_net: take the rtnl lock when calling virtnet_set_affinity

2021-03-12 Thread Antoine Tenart

netif_set_xps_queue must be called with the rtnl lock taken, and this is
now enforced using ASSERT_RTNL(). In virtio_net, netif_set_xps_queue is
called by virtnet_set_affinity. As this function can be called from an
ethtool helper, we can't take the rtnl lock directly in it. Instead we
take the rtnl lock when calling virtnet_set_affinity when the rtnl lock
isn't taken already.

Signed-off-by: Antoine Tenart 
---
 drivers/net/virtio_net.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dde9bbcc5ff0..54d2277f6c98 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2027,7 +2027,9 @@ static int virtnet_cpu_online(unsigned int cpu, struct 
hlist_node *node)
 {
struct virtnet_info *vi = hlist_entry_safe(node, struct virtnet_info,
   node);
+   rtnl_lock();
virtnet_set_affinity(vi);
+   rtnl_unlock();
return 0;
 }
 
@@ -2035,7 +2037,9 @@ static int virtnet_cpu_dead(unsigned int cpu, struct 
hlist_node *node)
 {
struct virtnet_info *vi = hlist_entry_safe(node, struct virtnet_info,
   node_dead);
+   rtnl_lock();
virtnet_set_affinity(vi);
+   rtnl_unlock();
return 0;
 }
 
@@ -2883,7 +2887,9 @@ static int init_vqs(struct virtnet_info *vi)
goto err_free;
 
get_online_cpus();
+   rtnl_lock();
virtnet_set_affinity(vi);
+   rtnl_unlock();
put_online_cpus();
 
return 0;
-- 
2.29.2

[PATCH net-next v3 14/16] net: NULL the old xps map entries when freeing them

2021-03-12 Thread Antoine Tenart

In __netif_set_xps_queue, old map entries from the old dev_maps are
freed but their corresponding entry in the old dev_maps aren't NULLed.
Fix this.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 748e377c7fe3..4f1b38de00ac 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2766,6 +2766,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
continue;
}
 
+   RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
kfree_rcu(map, rcu);
}
}
-- 
2.29.2

[PATCH net-next v3 13/16] net: fix use after free in xps

2021-03-12 Thread Antoine Tenart

When setting up an new dev_maps in __netif_set_xps_queue, we remove and
free maps from unused CPUs/rx-queues near the end of the function; by
calling remove_xps_queue. However it's possible those maps are also part
of the old not-freed-yet dev_maps, which might be used concurrently.
When that happens, a map can be freed while its corresponding entry in
the old dev_maps table isn't NULLed, leading to:

  BUG: KASAN: use-after-free in xps_queue_show+0x469/0x480

This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL
the map entries from the old dev_maps table.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 052797ca65f6..748e377c7fe3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2460,7 +2460,7 @@ static DEFINE_MUTEX(xps_map_mutex);
rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
 
 static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
-int tci, u16 index)
+struct xps_dev_maps *old_maps, int tci, u16 index)
 {
struct xps_map *map = NULL;
int pos;
@@ -2479,6 +2479,8 @@ static bool remove_xps_queue(struct xps_dev_maps 
*dev_maps,
break;
}
 
+   if (old_maps)
+   RCU_INIT_POINTER(old_maps->attr_map[tci], NULL);
RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
kfree_rcu(map, rcu);
return false;
@@ -2499,7 +2501,7 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
int i, j;
 
for (i = count, j = offset; i--; j++) {
-   if (!remove_xps_queue(dev_maps, tci, j))
+   if (!remove_xps_queue(dev_maps, NULL, tci, j))
break;
}
 
@@ -2631,7 +2633,7 @@ static void xps_copy_dev_maps(struct xps_dev_maps 
*dev_maps,
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, enum xps_map_type type)
 {
-   struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   struct xps_dev_maps *dev_maps, *new_dev_maps = NULL, *old_dev_maps = 
NULL;
const unsigned long *online_mask = NULL;
bool active = false, copy = false;
int i, j, tci, numa_node_id = -2;
@@ -2768,7 +2770,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
}
}
 
-   kfree_rcu(dev_maps, rcu);
+   old_dev_maps = copy ? dev_maps : NULL;
 
 out_no_old_maps:
dev_maps = new_dev_maps;
@@ -2794,10 +2796,14 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
netif_attr_test_online(j, online_mask, 
dev_maps->nr_ids))
continue;
 
-   active |= remove_xps_queue(dev_maps, tci, index);
+   active |= remove_xps_queue(dev_maps, old_dev_maps, tci,
+  index);
}
}
 
+   if (old_dev_maps)
+   kfree_rcu(old_dev_maps, rcu);
+
/* free map if not active */
if (!active)
reset_xps_maps(dev, dev_maps, type);
-- 
2.29.2

[PATCH net-next v3 08/16] net: move the xps maps to an array

2021-03-12 Thread Antoine Tenart

Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.

This should not modify the xps maps behaviour in any way.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 drivers/net/virtio_net.c  |  2 +-
 include/linux/netdevice.h | 17 +
 net/core/dev.c| 73 +--
 net/core/net-sysfs.c  |  6 ++--
 4 files changed, 46 insertions(+), 52 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index e97288dd6e5a..dde9bbcc5ff0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2015,7 +2015,7 @@ static void virtnet_set_affinity(struct virtnet_info *vi)
}
virtqueue_set_affinity(vi->rq[i].vq, mask);
virtqueue_set_affinity(vi->sq[i].vq, mask);
-   __netif_set_xps_queue(vi->dev, cpumask_bits(mask), i, false);
+   __netif_set_xps_queue(vi->dev, cpumask_bits(mask), i, XPS_CPUS);
cpumask_clear(mask);
}
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5c9e056a0e2d..bcd15a2d3ddc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -754,6 +754,13 @@ struct rx_queue_attribute {
 const char *buf, size_t len);
 };
 
+/* XPS map type and offset of the xps map within net_device->xps_maps[]. */
+enum xps_map_type {
+   XPS_CPUS = 0,
+   XPS_RXQS,
+   XPS_MAPS_MAX,
+};
+
 #ifdef CONFIG_XPS
 /*
  * This structure holds an XPS map which can be of variable length.  The
@@ -1773,8 +1780,7 @@ enum netdev_ml_priv_type {
  * @tx_queue_len:  Max frames per queue allowed
  * @tx_global_lock:XXX: need comments on this one
  * @xdp_bulkq: XDP device bulk queue
- * @xps_cpus_map:  all CPUs map for XPS device
- * @xps_rxqs_map:  all RXQs map for XPS device
+ * @xps_maps:  all CPUs/RXQs maps for XPS device
  *
  * @xps_maps:  XXX: need comments on this one
  * @miniq_egress:  clsact qdisc specific data for
@@ -2070,8 +2076,7 @@ struct net_device {
struct xdp_dev_bulk_queue __percpu *xdp_bulkq;
 
 #ifdef CONFIG_XPS
-   struct xps_dev_maps __rcu *xps_cpus_map;
-   struct xps_dev_maps __rcu *xps_rxqs_map;
+   struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
 #endif
 #ifdef CONFIG_NET_CLS_ACT
struct mini_Qdisc __rcu *miniq_egress;
@@ -3701,7 +3706,7 @@ static inline void netif_wake_subqueue(struct net_device 
*dev, u16 queue_index)
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
u16 index);
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
- u16 index, bool is_rxqs_map);
+ u16 index, enum xps_map_type type);
 
 /**
  * netif_attr_test_mask - Test a CPU or Rx queue set in a mask
@@ -3796,7 +3801,7 @@ static inline int netif_set_xps_queue(struct net_device 
*dev,
 
 static inline int __netif_set_xps_queue(struct net_device *dev,
const unsigned long *mask,
-   u16 index, bool is_rxqs_map)
+   u16 index, enum xps_map_type type)
 {
return 0;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 241f440b306a..dfdd476a6d67 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2511,31 +2511,34 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 
 static void reset_xps_maps(struct net_device *dev,
   struct xps_dev_maps *dev_maps,
-  bool is_rxqs_map)
+  enum xps_map_type type)
 {
-   if (is_rxqs_map) {
-   static_key_slow_dec_cpuslocked(&xps_rxqs_needed);
-   RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
-   } else {
-   RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
-   }
static_key_slow_dec_cpuslocked(&xps_needed);
+   if (type == XPS_RXQS)
+   static_key_slow_dec_cpuslocked(&xps_rxqs_needed);
+
+   RCU_INIT_POINTER(dev->xps_maps[type], NULL);
+
kfree_rcu(dev_maps, rcu);
 }
 
-static void clean_xps_maps(struct net_device *dev,
-  struct xps_dev_maps *dev_maps, u16 offset, u16 count,
-  bool is_rxqs_map)
+static void clean_xps_maps(struct net_device *dev, enum xps_map_type type,
+  u16 offset, u16 count)
 {
+   struct xps_dev_maps *dev_maps;
bool active = false;
int i, j;
 
+   dev_maps = xmap_dereference(dev->xps_maps[type]);
+   if (!dev_maps)
+   return;
+
for (j = 0; j < dev_maps->nr_ids; j++)

[PATCH net-next v3 12/16] net-sysfs: move the xps cpus/rxqs retrieval in a common function

2021-03-12 Thread Antoine Tenart

Most of the xps_cpus_show and xps_rxqs_show functions share the same
logic. Having it in two different functions does not help maintenance.
This patch moves their common logic into a new function, xps_queue_show,
to improve this.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 125 +--
 1 file changed, 48 insertions(+), 77 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 094fea082649..562a42fcd437 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1361,44 +1361,27 @@ static const struct attribute_group dql_group = {
 #endif /* CONFIG_BQL */
 
 #ifdef CONFIG_XPS
-static ssize_t xps_cpus_show(struct netdev_queue *queue,
-char *buf)
+static ssize_t xps_queue_show(struct net_device *dev, unsigned int index,
+ int tc, char *buf, enum xps_map_type type)
 {
-   struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned int index, nr_ids;
-   int j, len, ret, tc = 0;
unsigned long *mask;
-
-   if (!netif_is_multiqueue(dev))
-   return -ENOENT;
-
-   index = get_netdev_queue_index(queue);
-
-   if (!rtnl_trylock())
-   return restart_syscall();
-
-   /* If queue belongs to subordinate dev use its map */
-   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
-
-   tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0) {
-   rtnl_unlock();
-   return -EINVAL;
-   }
-
-   /* Make sure the subordinate device can't be freed */
-   get_device(&dev->dev);
-   rtnl_unlock();
+   unsigned int nr_ids;
+   int j, len;
 
rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_maps[XPS_CPUS]);
-   nr_ids = dev_maps ? dev_maps->nr_ids : nr_cpu_ids;
+   dev_maps = rcu_dereference(dev->xps_maps[type]);
+
+   /* Default to nr_cpu_ids/dev->num_rx_queues and do not just return 0
+* when dev_maps hasn't been allocated yet, to be backward compatible.
+*/
+   nr_ids = dev_maps ? dev_maps->nr_ids :
+(type == XPS_CPUS ? nr_cpu_ids : dev->num_rx_queues);
 
mask = bitmap_zalloc(nr_ids, GFP_KERNEL);
if (!mask) {
-   ret = -ENOMEM;
-   goto err_rcu_unlock;
+   rcu_read_unlock();
+   return -ENOMEM;
}
 
if (!dev_maps || tc >= dev_maps->num_tc)
@@ -1421,16 +1404,44 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
 out_no_maps:
rcu_read_unlock();
-   put_device(&dev->dev);
 
len = bitmap_print_to_pagebuf(false, buf, mask, nr_ids);
bitmap_free(mask);
+
return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf)
+{
+   struct net_device *dev = queue->dev;
+   unsigned int index;
+   int len, tc;
+
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
+   index = get_netdev_queue_index(queue);
+
+   if (!rtnl_trylock())
+   return restart_syscall();
+
+   /* If queue belongs to subordinate dev use its map */
+   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
+   tc = netdev_txq_to_tc(dev, index);
+   if (tc < 0) {
+   rtnl_unlock();
+   return -EINVAL;
+   }
+
+   /* Make sure the subordinate device can't be freed */
+   get_device(&dev->dev);
+   rtnl_unlock();
+
+   len = xps_queue_show(dev, index, tc, buf, XPS_CPUS);
 
-err_rcu_unlock:
-   rcu_read_unlock();
put_device(&dev->dev);
-   return ret;
+   return len;
 }
 
 static ssize_t xps_cpus_store(struct netdev_queue *queue,
@@ -1477,10 +1488,8 @@ static struct netdev_queue_attribute xps_cpus_attribute 
__ro_after_init
 static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
 {
struct net_device *dev = queue->dev;
-   struct xps_dev_maps *dev_maps;
-   unsigned int index, nr_ids;
-   int j, len, ret, tc = 0;
-   unsigned long *mask;
+   unsigned int index;
+   int tc;
 
index = get_netdev_queue_index(queue);
 
@@ -1492,45 +1501,7 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
if (tc < 0)
return -EINVAL;
 
-   rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_maps[XPS_RXQS]);
-   nr_ids = dev_maps ? dev_maps->nr_ids : dev->num_rx_queues;
-
-   mask = bitmap_zalloc(nr_ids, GFP_KERNEL);
-   if (!mask) {
-   ret = -ENOMEM;
-   goto err_rcu_unlock;
-   }
-
-   if (!dev_maps || tc >= dev_maps->num_tc)
-   goto out_no_maps;
-
-   for (j = 0; j < nr_ids; j++) {
-   int i, tci = j * dev_maps->num_tc + tc;
-

[PATCH net-next v3 09/16] net: add an helper to copy xps maps to the new dev_maps

2021-03-12 Thread Antoine Tenart

This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps
to new_dev_maps at a given index. The logic should be the same, with an
improved code readability and maintenance.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index dfdd476a6d67..4d39938417c4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2608,6 +2608,25 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
return new_map;
 }
 
+/* Copy xps maps at a given index */
+static void xps_copy_dev_maps(struct xps_dev_maps *dev_maps,
+ struct xps_dev_maps *new_dev_maps, int index,
+ int tc, bool skip_tc)
+{
+   int i, tci = index * dev_maps->num_tc;
+   struct xps_map *map;
+
+   /* copy maps belonging to foreign traffic classes */
+   for (i = 0; i < dev_maps->num_tc; i++, tci++) {
+   if (i == tc && skip_tc)
+   continue;
+
+   /* fill in the new device map from the old device map */
+   map = xmap_dereference(dev_maps->attr_map[tci]);
+   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
+   }
+}
+
 /* Must be called under rtnl_lock and cpus_read_lock */
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, enum xps_map_type type)
@@ -2696,23 +2715,16 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
}
 
for (j = 0; j < nr_ids; j++) {
-   /* copy maps belonging to foreign traffic classes */
-   for (i = tc, tci = j * num_tc; copy && i--; tci++) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
-   }
+   bool skip_tc = false;
 
-   /* We need to explicitly update tci as prevous loop
-* could break out early if dev_maps is NULL.
-*/
tci = j * num_tc + tc;
-
if (netif_attr_test_mask(j, mask, nr_ids) &&
netif_attr_test_online(j, online_mask, nr_ids)) {
/* add tx-queue to CPU/rx-queue maps */
int pos = 0;
 
+   skip_tc = true;
+
map = xmap_dereference(new_dev_maps->attr_map[tci]);
while ((pos < map->len) && (map->queues[pos] != index))
pos++;
@@ -2727,18 +2739,11 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
numa_node_id = -1;
}
 #endif
-   } else if (copy) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
 
-   /* copy maps belonging to foreign traffic classes */
-   for (i = num_tc - tc, tci++; copy && --i; tci++) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
-   }
+   if (copy)
+   xps_copy_dev_maps(dev_maps, new_dev_maps, j, tc,
+ skip_tc);
}
 
rcu_assign_pointer(dev->xps_maps[type], new_dev_maps);
-- 
2.29.2

[PATCH net-next v3 10/16] net: improve queue removal readability in __netif_set_xps_queue

2021-03-12 Thread Antoine Tenart

Improve the readability of the loop removing tx-queue from unused
CPUs/rx-queues in __netif_set_xps_queue. The change should only be
cosmetic.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4d39938417c4..052797ca65f6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2786,13 +2786,16 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
 
/* removes tx-queue from unused CPUs/rx-queues */
for (j = 0; j < dev_maps->nr_ids; j++) {
-   for (i = tc, tci = j * dev_maps->num_tc; i--; tci++)
-   active |= remove_xps_queue(dev_maps, tci, index);
-   if (!netif_attr_test_mask(j, mask, dev_maps->nr_ids) ||
-   !netif_attr_test_online(j, online_mask, dev_maps->nr_ids))
-   active |= remove_xps_queue(dev_maps, tci, index);
-   for (i = dev_maps->num_tc - tc, tci++; --i; tci++)
+   tci = j * dev_maps->num_tc;
+
+   for (i = 0; i < dev_maps->num_tc; i++, tci++) {
+   if (i == tc &&
+   netif_attr_test_mask(j, mask, dev_maps->nr_ids) &&
+   netif_attr_test_online(j, online_mask, 
dev_maps->nr_ids))
+   continue;
+
active |= remove_xps_queue(dev_maps, tci, index);
+   }
}
 
/* free map if not active */
-- 
2.29.2

[PATCH net-next v3 11/16] net-sysfs: move the rtnl unlock up in the xps show helpers

2021-03-12 Thread Antoine Tenart

Now that nr_ids and num_tc are stored in the xps dev_maps, which are RCU
protected, we do not have the need to protect the maps in the rtnl lock.
Move the rtnl unlock up so we reduce the rtnl locking section.

We also increase the reference count on the subordinate device if any,
as we don't want this device to be freed while we use it (now that the
rtnl lock isn't protecting it in the whole function).

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 25 +++--
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index ca1f3b63cfad..094fea082649 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1383,10 +1383,14 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
tc = netdev_txq_to_tc(dev, index);
if (tc < 0) {
-   ret = -EINVAL;
-   goto err_rtnl_unlock;
+   rtnl_unlock();
+   return -EINVAL;
}
 
+   /* Make sure the subordinate device can't be freed */
+   get_device(&dev->dev);
+   rtnl_unlock();
+
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_maps[XPS_CPUS]);
nr_ids = dev_maps ? dev_maps->nr_ids : nr_cpu_ids;
@@ -1417,8 +1421,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
 out_no_maps:
rcu_read_unlock();
-
-   rtnl_unlock();
+   put_device(&dev->dev);
 
len = bitmap_print_to_pagebuf(false, buf, mask, nr_ids);
bitmap_free(mask);
@@ -1426,8 +1429,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
 err_rcu_unlock:
rcu_read_unlock();
-err_rtnl_unlock:
-   rtnl_unlock();
+   put_device(&dev->dev);
return ret;
 }
 
@@ -1486,10 +1488,9 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
return restart_syscall();
 
tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0) {
-   ret = -EINVAL;
-   goto err_rtnl_unlock;
-   }
+   rtnl_unlock();
+   if (tc < 0)
+   return -EINVAL;
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_maps[XPS_RXQS]);
@@ -1522,8 +1523,6 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
 out_no_maps:
rcu_read_unlock();
 
-   rtnl_unlock();
-
len = bitmap_print_to_pagebuf(false, buf, mask, nr_ids);
bitmap_free(mask);
 
@@ -1531,8 +1530,6 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
 
 err_rcu_unlock:
rcu_read_unlock();
-err_rtnl_unlock:
-   rtnl_unlock();
return ret;
 }
 
-- 
2.29.2

[PATCH net-next v3 07/16] net: remove the xps possible_mask

2021-03-12 Thread Antoine Tenart

Remove the xps possible_mask. It was an optimization but we can just
loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That
simplifies the code a bit.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 net/core/dev.c   | 40 +---
 net/core/net-sysfs.c |  4 ++--
 2 files changed, 15 insertions(+), 29 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 24d8f059e2a6..241f440b306a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2523,33 +2523,28 @@ static void reset_xps_maps(struct net_device *dev,
kfree_rcu(dev_maps, rcu);
 }
 
-static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
+static void clean_xps_maps(struct net_device *dev,
   struct xps_dev_maps *dev_maps, u16 offset, u16 count,
   bool is_rxqs_map)
 {
-   unsigned int nr_ids = dev_maps->nr_ids;
bool active = false;
int i, j;
 
-   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids), j < nr_ids;)
-   active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
-  count);
+   for (j = 0; j < dev_maps->nr_ids; j++)
+   active |= remove_xps_queue_cpu(dev, dev_maps, j, offset, count);
if (!active)
reset_xps_maps(dev, dev_maps, is_rxqs_map);
 
if (!is_rxqs_map) {
-   for (i = offset + (count - 1); count--; i--) {
+   for (i = offset + (count - 1); count--; i--)
netdev_queue_numa_node_write(
-   netdev_get_tx_queue(dev, i),
-   NUMA_NO_NODE);
-   }
+   netdev_get_tx_queue(dev, i), NUMA_NO_NODE);
}
 }
 
 static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
   u16 count)
 {
-   const unsigned long *possible_mask = NULL;
struct xps_dev_maps *dev_maps;
 
if (!static_key_false(&xps_needed))
@@ -2561,17 +2556,14 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
if (static_key_false(&xps_rxqs_needed)) {
dev_maps = xmap_dereference(dev->xps_rxqs_map);
if (dev_maps)
-   clean_xps_maps(dev, possible_mask, dev_maps, offset,
-  count, true);
+   clean_xps_maps(dev, dev_maps, offset, count, true);
}
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
if (!dev_maps)
goto out_no_maps;
 
-   if (num_possible_cpus() > 1)
-   possible_mask = cpumask_bits(cpu_possible_mask);
-   clean_xps_maps(dev, possible_mask, dev_maps, offset, count, false);
+   clean_xps_maps(dev, dev_maps, offset, count, false);
 
 out_no_maps:
mutex_unlock(&xps_map_mutex);
@@ -2627,8 +2619,8 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map)
 {
-   const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   const unsigned long *online_mask = NULL;
bool active = false, copy = false;
int i, j, tci, numa_node_id = -2;
int maps_sz, num_tc = 1, tc = 0;
@@ -2658,10 +2650,8 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
nr_ids = dev->num_rx_queues;
} else {
maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
-   if (num_possible_cpus() > 1) {
+   if (num_possible_cpus() > 1)
online_mask = cpumask_bits(cpu_online_mask);
-   possible_mask = cpumask_bits(cpu_possible_mask);
-   }
dev_maps = xmap_dereference(dev->xps_cpus_map);
nr_ids = nr_cpu_ids;
}
@@ -2712,8 +2702,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
static_key_slow_inc_cpuslocked(&xps_rxqs_needed);
}
 
-   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
-j < nr_ids;) {
+   for (j = 0; j < nr_ids; j++) {
/* copy maps belonging to foreign traffic classes */
for (i = tc, tci = j * num_tc; copy && i--; tci++) {
/* fill in the new device map from the old device map */
@@ -2768,8 +2757,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (!dev_maps)
goto out_no_old_maps;
 
-   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
-j < dev_maps->nr_ids;) {
+   for (j = 0; j < dev_maps->

[PATCH net-next v3 06/16] net: assert the rtnl lock is held when calling __netif_set_xps_queue

2021-03-12 Thread Antoine Tenart

Add ASSERT_RTNL at the top of __netif_set_xps_queue and add a comment
about holding the rtnl lock before the function.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 98a9c620f05a..24d8f059e2a6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2623,7 +2623,7 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
return new_map;
 }
 
-/* Must be called under cpus_read_lock */
+/* Must be called under rtnl_lock and cpus_read_lock */
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map)
 {
@@ -2635,6 +2635,8 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
struct xps_map *map, *new_map;
unsigned int nr_ids;
 
+   ASSERT_RTNL();
+
if (dev->num_tc) {
/* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
-- 
2.29.2

[PATCH net-next v3 05/16] net: embed nr_ids in the xps maps

2021-03-12 Thread Antoine Tenart

Embed nr_ids (the number of cpu for the xps cpus map, and the number of
rxqs for the xps cpus map) in dev_maps. That will help not accessing out
of bound memory if those values change after dev_maps was allocated.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 include/linux/netdevice.h |  4 
 net/core/dev.c| 45 ++-
 net/core/net-sysfs.c  | 38 +++--
 3 files changed, 47 insertions(+), 40 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index aa5c45198785..5c9e056a0e2d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -772,6 +772,9 @@ struct xps_map {
 /*
  * This structure holds all XPS maps for device.  Maps are indexed by CPU.
  *
+ * We keep track of the number of cpus/rxqs used when the struct is allocated,
+ * in nr_ids. This will help not accessing out-of-bound memory.
+ *
  * We keep track of the number of traffic classes used when the struct is
  * allocated, in num_tc. This will be used to navigate the maps, to ensure 
we're
  * not crossing its upper bound, as the original dev->num_tc can be updated in
@@ -779,6 +782,7 @@ struct xps_map {
  */
 struct xps_dev_maps {
struct rcu_head rcu;
+   unsigned int nr_ids;
s16 num_tc;
struct xps_map __rcu *attr_map[]; /* Either CPUs map or RXQs map */
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index 3fa96d54654c..98a9c620f05a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2524,14 +2524,14 @@ static void reset_xps_maps(struct net_device *dev,
 }
 
 static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
-  struct xps_dev_maps *dev_maps, unsigned int nr_ids,
-  u16 offset, u16 count, bool is_rxqs_map)
+  struct xps_dev_maps *dev_maps, u16 offset, u16 count,
+  bool is_rxqs_map)
 {
+   unsigned int nr_ids = dev_maps->nr_ids;
bool active = false;
int i, j;
 
-   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids),
-j < nr_ids;)
+   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids), j < nr_ids;)
active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
   count);
if (!active)
@@ -2551,7 +2551,6 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 {
const unsigned long *possible_mask = NULL;
struct xps_dev_maps *dev_maps;
-   unsigned int nr_ids;
 
if (!static_key_false(&xps_needed))
return;
@@ -2561,11 +2560,9 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
if (static_key_false(&xps_rxqs_needed)) {
dev_maps = xmap_dereference(dev->xps_rxqs_map);
-   if (dev_maps) {
-   nr_ids = dev->num_rx_queues;
-   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
-  offset, count, true);
-   }
+   if (dev_maps)
+   clean_xps_maps(dev, possible_mask, dev_maps, offset,
+  count, true);
}
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
@@ -2574,9 +2571,7 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
if (num_possible_cpus() > 1)
possible_mask = cpumask_bits(cpu_possible_mask);
-   nr_ids = nr_cpu_ids;
-   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
-  false);
+   clean_xps_maps(dev, possible_mask, dev_maps, offset, count, false);
 
 out_no_maps:
mutex_unlock(&xps_map_mutex);
@@ -2673,11 +2668,12 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
maps_sz = L1_CACHE_BYTES;
 
/* The old dev_maps could be larger or smaller than the one we're
-* setting up now, as dev->num_tc could have been updated in between. We
-* could try to be smart, but let's be safe instead and only copy
-* foreign traffic classes if the two map sizes match.
+* setting up now, as dev->num_tc or nr_ids could have been updated in
+* between. We could try to be smart, but let's be safe instead and only
+* copy foreign traffic classes if the two map sizes match.
 */
-   if (dev_maps && dev_maps->num_tc == num_tc)
+   if (dev_maps &&
+   dev_maps->num_tc == num_tc && dev_maps->nr_ids == nr_ids)
copy = true;
 
/* allocate memory for queue storage */
@@ -2690,6 +2686,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
return -ENOMEM;
}
 
+

[PATCH net-next v3 04/16] net: embed num_tc in the xps maps

2021-03-12 Thread Antoine Tenart

The xps cpus/rxqs map is accessed using dev->num_tc, which is used when
allocating the map. But later updates of dev->num_tc can lead to having
a mismatch between the maps and how they're accessed. In such cases the
map values do not make any sense and out of bound accesses can occur
(that can be easily seen using KASAN).

This patch aims at fixing this by embedding num_tc into the maps, using
the value at the time the map is created. This brings two improvements:
- The maps can be accessed using the embedded num_tc, so we know for
  sure we won't have out of bound accesses.
- Checks can be made before accessing the maps so we know the values
  retrieved will make sense.

We also update __netif_set_xps_queue to conditionally copy old maps from
dev_maps in the new one only if the number of traffic classes from both
maps match.

Signed-off-by: Antoine Tenart 
---
 include/linux/netdevice.h |  6 
 net/core/dev.c| 63 +--
 net/core/net-sysfs.c  | 45 +++-
 3 files changed, 64 insertions(+), 50 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b379d08a12ed..aa5c45198785 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -771,9 +771,15 @@ struct xps_map {
 
 /*
  * This structure holds all XPS maps for device.  Maps are indexed by CPU.
+ *
+ * We keep track of the number of traffic classes used when the struct is
+ * allocated, in num_tc. This will be used to navigate the maps, to ensure 
we're
+ * not crossing its upper bound, as the original dev->num_tc can be updated in
+ * the meantime.
  */
 struct xps_dev_maps {
struct rcu_head rcu;
+   s16 num_tc;
struct xps_map __rcu *attr_map[]; /* Either CPUs map or RXQs map */
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 2bfdd528c7c3..3fa96d54654c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2491,7 +2491,7 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 struct xps_dev_maps *dev_maps,
 int cpu, u16 offset, u16 count)
 {
-   int num_tc = dev->num_tc ? : 1;
+   int num_tc = dev_maps->num_tc;
bool active = false;
int tci;
 
@@ -2634,10 +2634,10 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
 {
const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   bool active = false, copy = false;
int i, j, tci, numa_node_id = -2;
int maps_sz, num_tc = 1, tc = 0;
struct xps_map *map, *new_map;
-   bool active = false;
unsigned int nr_ids;
 
if (dev->num_tc) {
@@ -2672,19 +2672,29 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (maps_sz < L1_CACHE_BYTES)
maps_sz = L1_CACHE_BYTES;
 
+   /* The old dev_maps could be larger or smaller than the one we're
+* setting up now, as dev->num_tc could have been updated in between. We
+* could try to be smart, but let's be safe instead and only copy
+* foreign traffic classes if the two map sizes match.
+*/
+   if (dev_maps && dev_maps->num_tc == num_tc)
+   copy = true;
+
/* allocate memory for queue storage */
for (j = -1; j = netif_attrmask_next_and(j, online_mask, mask, nr_ids),
 j < nr_ids;) {
-   if (!new_dev_maps)
-   new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
if (!new_dev_maps) {
-   mutex_unlock(&xps_map_mutex);
-   return -ENOMEM;
+   new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
+   if (!new_dev_maps) {
+   mutex_unlock(&xps_map_mutex);
+   return -ENOMEM;
+   }
+
+   new_dev_maps->num_tc = num_tc;
}
 
tci = j * num_tc + tc;
-   map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
-NULL;
+   map = copy ? xmap_dereference(dev_maps->attr_map[tci]) : NULL;
 
map = expand_xps_map(map, j, index, is_rxqs_map);
if (!map)
@@ -2706,7 +2716,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 j < nr_ids;) {
/* copy maps belonging to foreign traffic classes */
-   for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
+   for (i = tc, tci = j * num_tc; copy && i--; tci++) {
/* fill in the new device map from the old device map */
map

[PATCH net-next v3 02/16] net-sysfs: store the return of get_netdev_queue_index in an unsigned int

2021-03-12 Thread Antoine Tenart

In net-sysfs, get_netdev_queue_index returns an unsigned int. Some of
its callers use an unsigned long to store the returned value. Update the
code to be consistent, this should only be cosmetic.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 3a083c0c9dd3..5dc4223f6b68 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1367,7 +1367,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
@@ -1437,7 +1438,7 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
  const char *buf, size_t len)
 {
struct net_device *dev = queue->dev;
-   unsigned long index;
+   unsigned int index;
cpumask_var_t mask;
int err;
 
@@ -1479,7 +1480,8 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
int j, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
 
index = get_netdev_queue_index(queue);
 
@@ -1541,7 +1543,8 @@ static ssize_t xps_rxqs_store(struct netdev_queue *queue, 
const char *buf,
 {
struct net_device *dev = queue->dev;
struct net *net = dev_net(dev);
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
int err;
 
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
-- 
2.29.2

[PATCH net-next v3 01/16] net-sysfs: convert xps_cpus_show to bitmap_zalloc

2021-03-12 Thread Antoine Tenart

Use bitmap_zalloc instead of zalloc_cpumask_var in xps_cpus_show to
align with xps_rxqs_show. This will improve maintenance and allow us to
factorize the two functions. The function should behave the same.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 307628fdf380..3a083c0c9dd3 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1367,8 +1367,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   cpumask_var_t mask;
-   unsigned long index;
+   unsigned long *mask, index;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
@@ -1396,7 +1395,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
}
 
-   if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
+   if (!mask) {
ret = -ENOMEM;
goto err_rtnl_unlock;
}
@@ -1414,7 +1414,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
for (i = map->len; i--;) {
if (map->queues[i] == index) {
-   cpumask_set_cpu(cpu, mask);
+   set_bit(cpu, mask);
break;
}
}
@@ -1424,8 +1424,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
rtnl_unlock();
 
-   len = snprintf(buf, PAGE_SIZE, "%*pb\n", cpumask_pr_args(mask));
-   free_cpumask_var(mask);
+   len = bitmap_print_to_pagebuf(false, buf, mask, nr_cpu_ids);
+   bitmap_free(mask);
return len < PAGE_SIZE ? len : -EINVAL;
 
 err_rtnl_unlock:
-- 
2.29.2

[PATCH net-next v3 03/16] net-sysfs: make xps_cpus_show and xps_rxqs_show consistent

2021-03-12 Thread Antoine Tenart

Make the implementations of xps_cpus_show and xps_rxqs_show to converge,
as the two share the same logic but diverted over time. This should not
modify their behaviour but will help future changes and improve
maintenance.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 5dc4223f6b68..5f76183ad5bc 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1364,7 +1364,7 @@ static const struct attribute_group dql_group = {
 static ssize_t xps_cpus_show(struct netdev_queue *queue,
 char *buf)
 {
-   int cpu, len, ret, num_tc = 1, tc = 0;
+   int j, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
unsigned long *mask;
@@ -1404,23 +1404,26 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_cpus_map);
-   if (dev_maps) {
-   for_each_possible_cpu(cpu) {
-   int i, tci = cpu * num_tc + tc;
-   struct xps_map *map;
-
-   map = rcu_dereference(dev_maps->attr_map[tci]);
-   if (!map)
-   continue;
-
-   for (i = map->len; i--;) {
-   if (map->queues[i] == index) {
-   set_bit(cpu, mask);
-   break;
-   }
+   if (!dev_maps)
+   goto out_no_maps;
+
+   for (j = -1; j = netif_attrmask_next(j, NULL, nr_cpu_ids),
+j < nr_cpu_ids;) {
+   int i, tci = j * num_tc + tc;
+   struct xps_map *map;
+
+   map = rcu_dereference(dev_maps->attr_map[tci]);
+   if (!map)
+   continue;
+
+   for (i = map->len; i--;) {
+   if (map->queues[i] == index) {
+   set_bit(j, mask);
+   break;
}
}
}
+out_no_maps:
rcu_read_unlock();
 
rtnl_unlock();
-- 
2.29.2

[PATCH net-next v3 00/16] net: xps: improve the xps maps handling

2021-03-12 Thread Antoine Tenart

Hello,

This series aims at fixing various issues with the xps code, including
out-of-bound accesses and use-after-free. While doing so we try to
improve the xps code maintainability and readability.

The main change is moving dev->num_tc and dev->nr_ids in the xps maps, to
avoid out-of-bound accesses as those two fields can be updated after the
maps have been allocated. This allows further reworks, to improve the
xps code readability and allow to stop taking the rtnl lock when
reading the maps in sysfs. The maps are moved to an array in net_device,
which simplifies the code a lot.

Note that patch 6 adds a check to assert the rtnl lock is taken when
calling netif_set_xps_queue. Two drivers (mlx5 and virtio_net) are fixed
in this regard, but we might see new warnings in the near future because
of this. This is expected and shouldn't be an issue (it's only a
warning, and the fix should be fairly easy to do).

One future improvement may be to remove the use of xps_map_mutex from
net/core/dev.c, but that may require extra care.

Thanks!
Antoine

Since v2:
  - Patches 13-16 are new to the series.
  - Fixed another issue I found while preparing v3 (use after free of
old xps maps).
  - Kept the rtnl lock when calling netdev_get_tx_queue and
netdev_txq_to_tc.
  - Use get_device/put_device when using the sb_dev.
  - Take the rtnl lock in mlx5 and virtio_net when calling
netif_set_xps_queue.
  - Fixed a coding style issue.

Since v1:
  - Reordered the patches to improve readability and avoid introducing
issues in between patches.
  - Use dev_maps->nr_ids to allocate the mask in xps_queue_show but
still default to nr_cpu_ids/dev->num_rx_queues in xps_queue_show
when dev_maps hasn't been allocated yet for backward compatibility.

Antoine Tenart (16):
  net-sysfs: convert xps_cpus_show to bitmap_zalloc
  net-sysfs: store the return of get_netdev_queue_index in an unsigned
int
  net-sysfs: make xps_cpus_show and xps_rxqs_show consistent
  net: embed num_tc in the xps maps
  net: embed nr_ids in the xps maps
  net: assert the rtnl lock is held when calling __netif_set_xps_queue
  net: remove the xps possible_mask
  net: move the xps maps to an array
  net: add an helper to copy xps maps to the new dev_maps
  net: improve queue removal readability in __netif_set_xps_queue
  net-sysfs: move the rtnl unlock up in the xps show helpers
  net-sysfs: move the xps cpus/rxqs retrieval in a common function
  net: fix use after free in xps
  net: NULL the old xps map entries when freeing them
  net/mlx5e: take the rtnl lock when calling netif_set_xps_queue
  virtio_net: take the rtnl lock when calling virtnet_set_affinity

 .../net/ethernet/mellanox/mlx5/core/en_main.c |  11 +-
 drivers/net/virtio_net.c  |   8 +-
 include/linux/netdevice.h |  27 +-
 net/core/dev.c| 250 +-
 net/core/net-sysfs.c  | 177 +
 5 files changed, 233 insertions(+), 240 deletions(-)

-- 
2.29.2

Re: [PATCH net-next v2 09/12] net-sysfs: remove the rtnl lock when accessing the xps maps

2021-02-09 Thread Antoine Tenart

Quoting Antoine Tenart (2021-02-09 10:12:11)
> 
> As the sb_dev pointer is protected by the rtnl lock, we'll have to keep
> the lock. I'll move that patch at the end of the series (it'll be easier
> to add the get_device/put_device logic with the xps_queue_show
> function). I'll also move netdev_txq_to_tc out of xps_queue_show, to
> call it under the rtnl lock taken.

That was not very clear. I meant I won't remove the rtnl lock, but will
try not to take it for too long, using get_device/put_device. We'll see
if I'll have a dedicated patch for that or not.

Re: [PATCH net-next v2 09/12] net-sysfs: remove the rtnl lock when accessing the xps maps

2021-02-09 Thread Antoine Tenart

Quoting Alexander Duyck (2021-02-08 23:20:58)
> On Mon, Feb 8, 2021 at 9:19 AM Antoine Tenart  wrote:
> > @@ -1328,17 +1328,12 @@ static ssize_t xps_cpus_show(struct netdev_queue 
> > *queue,
> >
> > index = get_netdev_queue_index(queue);
> >
> > -   if (!rtnl_trylock())
> > -   return restart_syscall();
> > -
> > /* If queue belongs to subordinate dev use its map */
> > dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
> >
> > tc = netdev_txq_to_tc(dev, index);
> > -   if (tc < 0) {
> > -   ret = -EINVAL;
> > -   goto err_rtnl_unlock;
> > -   }
> > +   if (tc < 0)
> > +   return -EINVAL;
> >
> > rcu_read_lock();
> > dev_maps = rcu_dereference(dev->xps_maps[XPS_CPUS]);
> 
> So I think we hit a snag here. The sb_dev pointer is protected by the
> rtnl_lock. So I don't think we can release the rtnl_lock until after
> we are done with the dev pointer.
> 
> Also I am not sure it is safe to use netdev_txq_to_tc without holding
> the lock. I don't know if we ever went through and guaranteed that it
> will always work if the lock isn't held since in theory the device
> could reprogram all the map values out from under us.
> 
> Odds are we should probably fix traffic_class_show as I suspect it
> probably also needs to be holding the rtnl_lock to prevent any
> possible races. I'll submit a patch for that.

Yet another possible race :-) Good catch, I thought about the one we
fixed already but not this one.

As the sb_dev pointer is protected by the rtnl lock, we'll have to keep
the lock. I'll move that patch at the end of the series (it'll be easier
to add the get_device/put_device logic with the xps_queue_show
function). I'll also move netdev_txq_to_tc out of xps_queue_show, to
call it under the rtnl lock taken.

Thanks,
Antoine

Re: [PATCH net-next v2 07/12] net: remove the xps possible_mask

2021-02-09 Thread Antoine Tenart

Quoting Alexander Duyck (2021-02-08 22:43:39)
> On Mon, Feb 8, 2021 at 9:19 AM Antoine Tenart  wrote:
> >
> > -static void clean_xps_maps(struct net_device *dev, const unsigned long 
> > *mask,
> > +static void clean_xps_maps(struct net_device *dev,
> >struct xps_dev_maps *dev_maps, u16 offset, u16 
> > count,
> >bool is_rxqs_map)
> >  {
> > -   unsigned int nr_ids = dev_maps->nr_ids;
> > bool active = false;
> > int i, j;
> >
> > -   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids), j < nr_ids;)
> > -   active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
> > -  count);
> > +   for (j = 0; j < dev_maps->nr_ids; j++)
> > +   active |= remove_xps_queue_cpu(dev, dev_maps, j, offset, 
> > count);
> > if (!active)
> > reset_xps_maps(dev, dev_maps, is_rxqs_map);
> >
> > -   if (!is_rxqs_map) {
> > -   for (i = offset + (count - 1); count--; i--) {
> > +   if (!is_rxqs_map)
> > +   for (i = offset + (count - 1); count--; i--)
> > netdev_queue_numa_node_write(
> > -   netdev_get_tx_queue(dev, i),
> > -   NUMA_NO_NODE);
> > -   }
> > -   }
> > +   netdev_get_tx_queue(dev, i), NUMA_NO_NODE);
> >  }
> 
> This violates the coding-style guide for the kernel. The if statement
> should still have braces as the for loop and
> netdev_queue_numa_node_write are more than a single statement. I'd be
> curious to see if checkpatch also complains about this because it
> probably should.

You're right, I'll remove that change to comply with the coding style.

I reran checkpatch, even with --strict, and it did not complain. Maybe
because it's a rework, not strictly new code.

Thanks,
Antoine

[PATCH net-next v2 12/12] net-sysfs: move the xps cpus/rxqs retrieval in a common function

2021-02-08 Thread Antoine Tenart

Most of the xps_cpus_show and xps_rxqs_show functions share the same
logic. Having it in two different functions does not help maintenance.
This patch moves their common logic into a new function, xps_queue_show,
to improve this.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 98 ++--
 1 file changed, 31 insertions(+), 67 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 6ce5772e799e..984c15248483 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1314,35 +1314,31 @@ static const struct attribute_group dql_group = {
 #endif /* CONFIG_BQL */
 
 #ifdef CONFIG_XPS
-static ssize_t xps_cpus_show(struct netdev_queue *queue,
-char *buf)
+static ssize_t xps_queue_show(struct net_device *dev, unsigned int index,
+ char *buf, enum xps_map_type type)
 {
-   struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned int index, nr_ids;
-   int j, len, ret, tc = 0;
unsigned long *mask;
-
-   if (!netif_is_multiqueue(dev))
-   return -ENOENT;
-
-   index = get_netdev_queue_index(queue);
-
-   /* If queue belongs to subordinate dev use its map */
-   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+   unsigned int nr_ids;
+   int j, len, tc = 0;
 
tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
 
rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_maps[XPS_CPUS]);
-   nr_ids = dev_maps ? dev_maps->nr_ids : nr_cpu_ids;
+   dev_maps = rcu_dereference(dev->xps_maps[type]);
+
+   /* Default to nr_cpu_ids/dev->num_rx_queues and do not just return 0
+* when dev_maps hasn't been allocated yet, to be backward compatible.
+*/
+   nr_ids = dev_maps ? dev_maps->nr_ids :
+(type == XPS_CPUS ? nr_cpu_ids : dev->num_rx_queues);
 
mask = bitmap_zalloc(nr_ids, GFP_KERNEL);
if (!mask) {
-   ret = -ENOMEM;
-   goto err_rcu_unlock;
+   rcu_read_unlock();
+   return -ENOMEM;
}
 
if (!dev_maps || tc >= dev_maps->num_tc)
@@ -1368,11 +1364,24 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
len = bitmap_print_to_pagebuf(false, buf, mask, nr_ids);
bitmap_free(mask);
+
return len < PAGE_SIZE ? len : -EINVAL;
+}
 
-err_rcu_unlock:
-   rcu_read_unlock();
-   return ret;
+static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf)
+{
+   struct net_device *dev = queue->dev;
+   unsigned int index;
+
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
+   index = get_netdev_queue_index(queue);
+
+   /* If queue belongs to subordinate dev use its map */
+   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
+   return xps_queue_show(dev, index, buf, XPS_CPUS);
 }
 
 static ssize_t xps_cpus_store(struct netdev_queue *queue,
@@ -1419,56 +1428,11 @@ static struct netdev_queue_attribute xps_cpus_attribute 
__ro_after_init
 static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
 {
struct net_device *dev = queue->dev;
-   struct xps_dev_maps *dev_maps;
-   unsigned int index, nr_ids;
-   int j, len, ret, tc = 0;
-   unsigned long *mask;
+   unsigned int index;
 
index = get_netdev_queue_index(queue);
 
-   tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0)
-   return -EINVAL;
-
-   rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_maps[XPS_RXQS]);
-   nr_ids = dev_maps ? dev_maps->nr_ids : dev->num_rx_queues;
-
-   mask = bitmap_zalloc(nr_ids, GFP_KERNEL);
-   if (!mask) {
-   ret = -ENOMEM;
-   goto err_rcu_unlock;
-   }
-
-   if (!dev_maps || tc >= dev_maps->num_tc)
-   goto out_no_maps;
-
-   for (j = 0; j < nr_ids; j++) {
-   int i, tci = j * dev_maps->num_tc + tc;
-   struct xps_map *map;
-
-   map = rcu_dereference(dev_maps->attr_map[tci]);
-   if (!map)
-   continue;
-
-   for (i = map->len; i--;) {
-   if (map->queues[i] == index) {
-   set_bit(j, mask);
-   break;
-   }
-   }
-   }
-out_no_maps:
-   rcu_read_unlock();
-
-   len = bitmap_print_to_pagebuf(false, buf, mask, nr_ids);
-   bitmap_free(mask);
-
-   return len < PAGE_SIZE ? len : -EINVAL;
-
-err_rcu_unlock:
-   rcu_read_unlock();
-   return ret;
+   return xps_queue_show(dev, index, buf, XPS_RXQS);
 }
 
 static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
-- 
2.29.2

[PATCH net-next v2 10/12] net: add an helper to copy xps maps to the new dev_maps

2021-02-08 Thread Antoine Tenart

This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps
to new_dev_maps at a given index. The logic should be the same, with an
improved code readability and maintenance.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 6a2f827beca1..9b91e0d0895c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2589,6 +2589,25 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
return new_map;
 }
 
+/* Copy xps maps at a given index */
+static void xps_copy_dev_maps(struct xps_dev_maps *dev_maps,
+ struct xps_dev_maps *new_dev_maps, int index,
+ int tc, bool skip_tc)
+{
+   int i, tci = index * dev_maps->num_tc;
+   struct xps_map *map;
+
+   /* copy maps belonging to foreign traffic classes */
+   for (i = 0; i < dev_maps->num_tc; i++, tci++) {
+   if (i == tc && skip_tc)
+   continue;
+
+   /* fill in the new device map from the old device map */
+   map = xmap_dereference(dev_maps->attr_map[tci]);
+   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
+   }
+}
+
 /* Must be called under rtnl_lock and cpus_read_lock */
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, enum xps_map_type type)
@@ -2676,23 +2695,16 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
}
 
for (j = 0; j < nr_ids; j++) {
-   /* copy maps belonging to foreign traffic classes */
-   for (i = tc, tci = j * num_tc; copy && i--; tci++) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
-   }
+   bool skip_tc = false;
 
-   /* We need to explicitly update tci as prevous loop
-* could break out early if dev_maps is NULL.
-*/
tci = j * num_tc + tc;
-
if (netif_attr_test_mask(j, mask, nr_ids) &&
netif_attr_test_online(j, online_mask, nr_ids)) {
/* add tx-queue to CPU/rx-queue maps */
int pos = 0;
 
+   skip_tc = true;
+
map = xmap_dereference(new_dev_maps->attr_map[tci]);
while ((pos < map->len) && (map->queues[pos] != index))
pos++;
@@ -2707,18 +2719,11 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
numa_node_id = -1;
}
 #endif
-   } else if (copy) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
 
-   /* copy maps belonging to foreign traffic classes */
-   for (i = num_tc - tc, tci++; copy && --i; tci++) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
-   }
+   if (copy)
+   xps_copy_dev_maps(dev_maps, new_dev_maps, j, tc,
+ skip_tc);
}
 
rcu_assign_pointer(dev->xps_maps[type], new_dev_maps);
-- 
2.29.2

[PATCH net-next v2 11/12] net: improve queue removal readability in __netif_set_xps_queue

2021-02-08 Thread Antoine Tenart

Improve the readability of the loop removing tx-queue from unused
CPUs/rx-queues in __netif_set_xps_queue. The change should only be
cosmetic.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 9b91e0d0895c..7c3ac6736bb6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2766,13 +2766,16 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
 
/* removes tx-queue from unused CPUs/rx-queues */
for (j = 0; j < dev_maps->nr_ids; j++) {
-   for (i = tc, tci = j * dev_maps->num_tc; i--; tci++)
-   active |= remove_xps_queue(dev_maps, tci, index);
-   if (!netif_attr_test_mask(j, mask, dev_maps->nr_ids) ||
-   !netif_attr_test_online(j, online_mask, dev_maps->nr_ids))
-   active |= remove_xps_queue(dev_maps, tci, index);
-   for (i = dev_maps->num_tc - tc, tci++; --i; tci++)
+   tci = j * dev_maps->num_tc;
+
+   for (i = 0; i < dev_maps->num_tc; i++, tci++) {
+   if (i == tc &&
+   netif_attr_test_mask(j, mask, dev_maps->nr_ids) &&
+   netif_attr_test_online(j, online_mask, 
dev_maps->nr_ids))
+   continue;
+
active |= remove_xps_queue(dev_maps, tci, index);
+   }
}
 
/* free map if not active */
-- 
2.29.2

[PATCH net-next v2 04/12] net: embed num_tc in the xps maps

2021-02-08 Thread Antoine Tenart

The xps cpus/rxqs map is accessed using dev->num_tc, which is used when
allocating the map. But later updates of dev->num_tc can lead to having
a mismatch between the maps and how they're accessed. In such cases the
map values do not make any sense and out of bound accesses can occur
(that can be easily seen using KASAN).

This patch aims at fixing this by embedding num_tc into the maps, using
the value at the time the map is created. This brings two improvements:
- The maps can be accessed using the embedded num_tc, so we know for
  sure we won't have out of bound accesses.
- Checks can be made before accessing the maps so we know the values
  retrieved will make sense.

We also update __netif_set_xps_queue to conditionally copy old maps from
dev_maps in the new one only if the number of traffic classes from both
maps match.

Signed-off-by: Antoine Tenart 
---
 include/linux/netdevice.h |  6 
 net/core/dev.c| 63 +--
 net/core/net-sysfs.c  | 45 +++-
 3 files changed, 64 insertions(+), 50 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e9e7ada07ea1..d7d3c646d40d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -779,9 +779,15 @@ struct xps_map {
 
 /*
  * This structure holds all XPS maps for device.  Maps are indexed by CPU.
+ *
+ * We keep track of the number of traffic classes used when the struct is
+ * allocated, in num_tc. This will be used to navigate the maps, to ensure 
we're
+ * not crossing its upper bound, as the original dev->num_tc can be updated in
+ * the meantime.
  */
 struct xps_dev_maps {
struct rcu_head rcu;
+   s16 num_tc;
struct xps_map __rcu *attr_map[]; /* Either CPUs map or RXQs map */
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 21d74d30f5d7..7c5e2c614723 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2473,7 +2473,7 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 struct xps_dev_maps *dev_maps,
 int cpu, u16 offset, u16 count)
 {
-   int num_tc = dev->num_tc ? : 1;
+   int num_tc = dev_maps->num_tc;
bool active = false;
int tci;
 
@@ -2616,10 +2616,10 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
 {
const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   bool active = false, copy = false;
int i, j, tci, numa_node_id = -2;
int maps_sz, num_tc = 1, tc = 0;
struct xps_map *map, *new_map;
-   bool active = false;
unsigned int nr_ids;
 
if (dev->num_tc) {
@@ -2654,19 +2654,29 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (maps_sz < L1_CACHE_BYTES)
maps_sz = L1_CACHE_BYTES;
 
+   /* The old dev_maps could be larger or smaller than the one we're
+* setting up now, as dev->num_tc could have been updated in between. We
+* could try to be smart, but let's be safe instead and only copy
+* foreign traffic classes if the two map sizes match.
+*/
+   if (dev_maps && dev_maps->num_tc == num_tc)
+   copy = true;
+
/* allocate memory for queue storage */
for (j = -1; j = netif_attrmask_next_and(j, online_mask, mask, nr_ids),
 j < nr_ids;) {
-   if (!new_dev_maps)
-   new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
if (!new_dev_maps) {
-   mutex_unlock(&xps_map_mutex);
-   return -ENOMEM;
+   new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
+   if (!new_dev_maps) {
+   mutex_unlock(&xps_map_mutex);
+   return -ENOMEM;
+   }
+
+   new_dev_maps->num_tc = num_tc;
}
 
tci = j * num_tc + tc;
-   map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
-NULL;
+   map = copy ? xmap_dereference(dev_maps->attr_map[tci]) : NULL;
 
map = expand_xps_map(map, j, index, is_rxqs_map);
if (!map)
@@ -2688,7 +2698,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 j < nr_ids;) {
/* copy maps belonging to foreign traffic classes */
-   for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
+   for (i = tc, tci = j * num_tc; copy && i--; tci++) {
/* fill in the new device map from the old device map */
map

[PATCH net-next v2 07/12] net: remove the xps possible_mask

2021-02-08 Thread Antoine Tenart

Remove the xps possible_mask. It was an optimization but we can just
loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That
simplifies the code a bit.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 net/core/dev.c   | 43 ++-
 net/core/net-sysfs.c |  4 ++--
 2 files changed, 16 insertions(+), 31 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index abbb2ae6b3ed..d0c07ccea2e5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2505,33 +2505,27 @@ static void reset_xps_maps(struct net_device *dev,
kfree_rcu(dev_maps, rcu);
 }
 
-static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
+static void clean_xps_maps(struct net_device *dev,
   struct xps_dev_maps *dev_maps, u16 offset, u16 count,
   bool is_rxqs_map)
 {
-   unsigned int nr_ids = dev_maps->nr_ids;
bool active = false;
int i, j;
 
-   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids), j < nr_ids;)
-   active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
-  count);
+   for (j = 0; j < dev_maps->nr_ids; j++)
+   active |= remove_xps_queue_cpu(dev, dev_maps, j, offset, count);
if (!active)
reset_xps_maps(dev, dev_maps, is_rxqs_map);
 
-   if (!is_rxqs_map) {
-   for (i = offset + (count - 1); count--; i--) {
+   if (!is_rxqs_map)
+   for (i = offset + (count - 1); count--; i--)
netdev_queue_numa_node_write(
-   netdev_get_tx_queue(dev, i),
-   NUMA_NO_NODE);
-   }
-   }
+   netdev_get_tx_queue(dev, i), NUMA_NO_NODE);
 }
 
 static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
   u16 count)
 {
-   const unsigned long *possible_mask = NULL;
struct xps_dev_maps *dev_maps;
 
if (!static_key_false(&xps_needed))
@@ -2543,17 +2537,14 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
if (static_key_false(&xps_rxqs_needed)) {
dev_maps = xmap_dereference(dev->xps_rxqs_map);
if (dev_maps)
-   clean_xps_maps(dev, possible_mask, dev_maps, offset,
-  count, true);
+   clean_xps_maps(dev, dev_maps, offset, count, true);
}
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
if (!dev_maps)
goto out_no_maps;
 
-   if (num_possible_cpus() > 1)
-   possible_mask = cpumask_bits(cpu_possible_mask);
-   clean_xps_maps(dev, possible_mask, dev_maps, offset, count, false);
+   clean_xps_maps(dev, dev_maps, offset, count, false);
 
 out_no_maps:
mutex_unlock(&xps_map_mutex);
@@ -2609,8 +2600,8 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map)
 {
-   const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   const unsigned long *online_mask = NULL;
bool active = false, copy = false;
int i, j, tci, numa_node_id = -2;
int maps_sz, num_tc = 1, tc = 0;
@@ -2640,10 +2631,8 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
nr_ids = dev->num_rx_queues;
} else {
maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
-   if (num_possible_cpus() > 1) {
+   if (num_possible_cpus() > 1)
online_mask = cpumask_bits(cpu_online_mask);
-   possible_mask = cpumask_bits(cpu_possible_mask);
-   }
dev_maps = xmap_dereference(dev->xps_cpus_map);
nr_ids = nr_cpu_ids;
}
@@ -2693,8 +2682,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
static_key_slow_inc_cpuslocked(&xps_rxqs_needed);
}
 
-   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
-j < nr_ids;) {
+   for (j = 0; j < nr_ids; j++) {
/* copy maps belonging to foreign traffic classes */
for (i = tc, tci = j * num_tc; copy && i--; tci++) {
/* fill in the new device map from the old device map */
@@ -2749,8 +2737,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (!dev_maps)
goto out_no_old_maps;
 
-   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
-j < nr_ids;) {
+   for (j = 0; j &l

[PATCH net-next v2 08/12] net: move the xps maps to an array

2021-02-08 Thread Antoine Tenart

Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.

This should not modify the xps maps behaviour in any way.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 drivers/net/virtio_net.c  |  2 +-
 include/linux/netdevice.h | 17 +
 net/core/dev.c| 73 +--
 net/core/net-sysfs.c  |  6 ++--
 4 files changed, 46 insertions(+), 52 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e63792549..1c98ef44c6a1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1980,7 +1980,7 @@ static void virtnet_set_affinity(struct virtnet_info *vi)
}
virtqueue_set_affinity(vi->rq[i].vq, mask);
virtqueue_set_affinity(vi->sq[i].vq, mask);
-   __netif_set_xps_queue(vi->dev, cpumask_bits(mask), i, false);
+   __netif_set_xps_queue(vi->dev, cpumask_bits(mask), i, XPS_CPUS);
cpumask_clear(mask);
}
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 40683b6eee54..e868ce03db89 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -762,6 +762,13 @@ struct rx_queue_attribute {
 const char *buf, size_t len);
 };
 
+/* XPS map type and offset of the xps map within net_device->xps_maps[]. */
+enum xps_map_type {
+   XPS_CPUS = 0,
+   XPS_RXQS,
+   XPS_MAPS_MAX,
+};
+
 #ifdef CONFIG_XPS
 /*
  * This structure holds an XPS map which can be of variable length.  The
@@ -1770,8 +1777,7 @@ enum netdev_priv_flags {
  * @tx_queue_len:  Max frames per queue allowed
  * @tx_global_lock:XXX: need comments on this one
  * @xdp_bulkq: XDP device bulk queue
- * @xps_cpus_map:  all CPUs map for XPS device
- * @xps_rxqs_map:  all RXQs map for XPS device
+ * @xps_maps:  all CPUs/RXQs maps for XPS device
  *
  * @xps_maps:  XXX: need comments on this one
  * @miniq_egress:  clsact qdisc specific data for
@@ -2064,8 +2070,7 @@ struct net_device {
struct xdp_dev_bulk_queue __percpu *xdp_bulkq;
 
 #ifdef CONFIG_XPS
-   struct xps_dev_maps __rcu *xps_cpus_map;
-   struct xps_dev_maps __rcu *xps_rxqs_map;
+   struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
 #endif
 #ifdef CONFIG_NET_CLS_ACT
struct mini_Qdisc __rcu *miniq_egress;
@@ -3669,7 +3674,7 @@ static inline void netif_wake_subqueue(struct net_device 
*dev, u16 queue_index)
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
u16 index);
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
- u16 index, bool is_rxqs_map);
+ u16 index, enum xps_map_type type);
 
 /**
  * netif_attr_test_mask - Test a CPU or Rx queue set in a mask
@@ -3764,7 +3769,7 @@ static inline int netif_set_xps_queue(struct net_device 
*dev,
 
 static inline int __netif_set_xps_queue(struct net_device *dev,
const unsigned long *mask,
-   u16 index, bool is_rxqs_map)
+   u16 index, enum xps_map_type type)
 {
return 0;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index d0c07ccea2e5..6a2f827beca1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2493,31 +2493,34 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 
 static void reset_xps_maps(struct net_device *dev,
   struct xps_dev_maps *dev_maps,
-  bool is_rxqs_map)
+  enum xps_map_type type)
 {
-   if (is_rxqs_map) {
-   static_key_slow_dec_cpuslocked(&xps_rxqs_needed);
-   RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
-   } else {
-   RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
-   }
static_key_slow_dec_cpuslocked(&xps_needed);
+   if (type == XPS_RXQS)
+   static_key_slow_dec_cpuslocked(&xps_rxqs_needed);
+
+   RCU_INIT_POINTER(dev->xps_maps[type], NULL);
+
kfree_rcu(dev_maps, rcu);
 }
 
-static void clean_xps_maps(struct net_device *dev,
-  struct xps_dev_maps *dev_maps, u16 offset, u16 count,
-  bool is_rxqs_map)
+static void clean_xps_maps(struct net_device *dev, enum xps_map_type type,
+  u16 offset, u16 count)
 {
+   struct xps_dev_maps *dev_maps;
bool active = false;
int i, j;
 
+   dev_maps = xmap_dereference(dev->xps_maps[type]);
+   if (!dev_maps)
+   return;
+
for (j = 0; j < dev_maps->nr_ids; j++)

[PATCH net-next v2 05/12] net: embed nr_ids in the xps maps

2021-02-08 Thread Antoine Tenart

Embed nr_ids (the number of cpu for the xps cpus map, and the number of
rxqs for the xps cpus map) in dev_maps. That will help not accessing out
of bound memory if those values change after dev_maps was allocated.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 include/linux/netdevice.h |  4 
 net/core/dev.c| 34 +++---
 net/core/net-sysfs.c  | 38 ++
 3 files changed, 41 insertions(+), 35 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d7d3c646d40d..40683b6eee54 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -780,6 +780,9 @@ struct xps_map {
 /*
  * This structure holds all XPS maps for device.  Maps are indexed by CPU.
  *
+ * We keep track of the number of cpus/rxqs used when the struct is allocated,
+ * in nr_ids. This will help not accessing out-of-bound memory.
+ *
  * We keep track of the number of traffic classes used when the struct is
  * allocated, in num_tc. This will be used to navigate the maps, to ensure 
we're
  * not crossing its upper bound, as the original dev->num_tc can be updated in
@@ -787,6 +790,7 @@ struct xps_map {
  */
 struct xps_dev_maps {
struct rcu_head rcu;
+   unsigned int nr_ids;
s16 num_tc;
struct xps_map __rcu *attr_map[]; /* Either CPUs map or RXQs map */
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index 7c5e2c614723..1f7df0bd415c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2506,14 +2506,14 @@ static void reset_xps_maps(struct net_device *dev,
 }
 
 static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
-  struct xps_dev_maps *dev_maps, unsigned int nr_ids,
-  u16 offset, u16 count, bool is_rxqs_map)
+  struct xps_dev_maps *dev_maps, u16 offset, u16 count,
+  bool is_rxqs_map)
 {
+   unsigned int nr_ids = dev_maps->nr_ids;
bool active = false;
int i, j;
 
-   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids),
-j < nr_ids;)
+   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids), j < nr_ids;)
active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
   count);
if (!active)
@@ -2533,7 +2533,6 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 {
const unsigned long *possible_mask = NULL;
struct xps_dev_maps *dev_maps;
-   unsigned int nr_ids;
 
if (!static_key_false(&xps_needed))
return;
@@ -2543,11 +2542,9 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
if (static_key_false(&xps_rxqs_needed)) {
dev_maps = xmap_dereference(dev->xps_rxqs_map);
-   if (dev_maps) {
-   nr_ids = dev->num_rx_queues;
-   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
-  offset, count, true);
-   }
+   if (dev_maps)
+   clean_xps_maps(dev, possible_mask, dev_maps, offset,
+  count, true);
}
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
@@ -2556,9 +2553,7 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
if (num_possible_cpus() > 1)
possible_mask = cpumask_bits(cpu_possible_mask);
-   nr_ids = nr_cpu_ids;
-   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
-  false);
+   clean_xps_maps(dev, possible_mask, dev_maps, offset, count, false);
 
 out_no_maps:
mutex_unlock(&xps_map_mutex);
@@ -2672,6 +2667,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
return -ENOMEM;
}
 
+   new_dev_maps->nr_ids = nr_ids;
new_dev_maps->num_tc = num_tc;
}
 
@@ -2786,12 +2782,12 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
goto out_no_maps;
 
/* removes tx-queue from unused CPUs/rx-queues */
-   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
-j < nr_ids;) {
+   for (j = -1; j = netif_attrmask_next(j, possible_mask, 
dev_maps->nr_ids),
+j < dev_maps->nr_ids;) {
for (i = tc, tci = j * dev_maps->num_tc; i--; tci++)
active |= remove_xps_queue(dev_maps, tci, index);
-   if (!netif_attr_test_mask(j, mask, nr_ids) ||
-   !netif_attr_test_online(j, online_mask, nr_ids))
+   if (!netif_attr_test_mask(j, mask, dev_maps->nr_ids) ||
+   !netif_attr_test_online(

[PATCH net-next v2 06/12] net: assert the rtnl lock is held when calling __netif_set_xps_queue

2021-02-08 Thread Antoine Tenart

Add ASSERT_RTNL at the top of __netif_set_xps_queue and add a comment
about holding the rtnl lock before the function.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1f7df0bd415c..abbb2ae6b3ed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2605,7 +2605,7 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
return new_map;
 }
 
-/* Must be called under cpus_read_lock */
+/* Must be called under rtnl_lock and cpus_read_lock */
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map)
 {
@@ -2617,6 +2617,8 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
struct xps_map *map, *new_map;
unsigned int nr_ids;
 
+   ASSERT_RTNL();
+
if (dev->num_tc) {
/* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
-- 
2.29.2

[PATCH net-next v2 09/12] net-sysfs: remove the rtnl lock when accessing the xps maps

2021-02-08 Thread Antoine Tenart

Now that nr_ids and num_tc are stored in the xps dev_maps, which are RCU
protected, we do not have the need to protect the xps_cpus_show and
xps_rxqs_show functions with the rtnl lock.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 26 --
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c2276b589cfb..6ce5772e799e 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1328,17 +1328,12 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
index = get_netdev_queue_index(queue);
 
-   if (!rtnl_trylock())
-   return restart_syscall();
-
/* If queue belongs to subordinate dev use its map */
dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
 
tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0) {
-   ret = -EINVAL;
-   goto err_rtnl_unlock;
-   }
+   if (tc < 0)
+   return -EINVAL;
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_maps[XPS_CPUS]);
@@ -1371,16 +1366,12 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 out_no_maps:
rcu_read_unlock();
 
-   rtnl_unlock();
-
len = bitmap_print_to_pagebuf(false, buf, mask, nr_ids);
bitmap_free(mask);
return len < PAGE_SIZE ? len : -EINVAL;
 
 err_rcu_unlock:
rcu_read_unlock();
-err_rtnl_unlock:
-   rtnl_unlock();
return ret;
 }
 
@@ -1435,14 +1426,9 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
 
index = get_netdev_queue_index(queue);
 
-   if (!rtnl_trylock())
-   return restart_syscall();
-
tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0) {
-   ret = -EINVAL;
-   goto err_rtnl_unlock;
-   }
+   if (tc < 0)
+   return -EINVAL;
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_maps[XPS_RXQS]);
@@ -1475,8 +1461,6 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
 out_no_maps:
rcu_read_unlock();
 
-   rtnl_unlock();
-
len = bitmap_print_to_pagebuf(false, buf, mask, nr_ids);
bitmap_free(mask);
 
@@ -1484,8 +1468,6 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
 
 err_rcu_unlock:
rcu_read_unlock();
-err_rtnl_unlock:
-   rtnl_unlock();
return ret;
 }
 
-- 
2.29.2

[PATCH net-next v2 03/12] net-sysfs: make xps_cpus_show and xps_rxqs_show consistent

2021-02-08 Thread Antoine Tenart

Make the implementations of xps_cpus_show and xps_rxqs_show to converge,
as the two share the same logic but diverted over time. This should not
modify their behaviour but will help future changes and improve
maintenance.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 5a39e9b38e5f..3915b1826814 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1317,7 +1317,7 @@ static const struct attribute_group dql_group = {
 static ssize_t xps_cpus_show(struct netdev_queue *queue,
 char *buf)
 {
-   int cpu, len, ret, num_tc = 1, tc = 0;
+   int j, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
unsigned long *mask;
@@ -1357,23 +1357,26 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_cpus_map);
-   if (dev_maps) {
-   for_each_possible_cpu(cpu) {
-   int i, tci = cpu * num_tc + tc;
-   struct xps_map *map;
-
-   map = rcu_dereference(dev_maps->attr_map[tci]);
-   if (!map)
-   continue;
-
-   for (i = map->len; i--;) {
-   if (map->queues[i] == index) {
-   set_bit(cpu, mask);
-   break;
-   }
+   if (!dev_maps)
+   goto out_no_maps;
+
+   for (j = -1; j = netif_attrmask_next(j, NULL, nr_cpu_ids),
+j < nr_cpu_ids;) {
+   int i, tci = j * num_tc + tc;
+   struct xps_map *map;
+
+   map = rcu_dereference(dev_maps->attr_map[tci]);
+   if (!map)
+   continue;
+
+   for (i = map->len; i--;) {
+   if (map->queues[i] == index) {
+   set_bit(j, mask);
+   break;
}
}
}
+out_no_maps:
rcu_read_unlock();
 
rtnl_unlock();
-- 
2.29.2

[PATCH net-next v2 02/12] net-sysfs: store the return of get_netdev_queue_index in an unsigned int

2021-02-08 Thread Antoine Tenart

In net-sysfs, get_netdev_queue_index returns an unsigned int. Some of
its callers use an unsigned long to store the returned value. Update the
code to be consistent, this should only be cosmetic.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e052fc5f7e94..5a39e9b38e5f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1320,7 +1320,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
@@ -1390,7 +1391,7 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
  const char *buf, size_t len)
 {
struct net_device *dev = queue->dev;
-   unsigned long index;
+   unsigned int index;
cpumask_var_t mask;
int err;
 
@@ -1432,7 +1433,8 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
int j, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
 
index = get_netdev_queue_index(queue);
 
@@ -1494,7 +1496,8 @@ static ssize_t xps_rxqs_store(struct netdev_queue *queue, 
const char *buf,
 {
struct net_device *dev = queue->dev;
struct net *net = dev_net(dev);
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
int err;
 
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
-- 
2.29.2

[PATCH net-next v2 01/12] net-sysfs: convert xps_cpus_show to bitmap_zalloc

2021-02-08 Thread Antoine Tenart

Use bitmap_zalloc instead of zalloc_cpumask_var in xps_cpus_show to
align with xps_rxqs_show. This will improve maintenance and allow us to
factorize the two functions. The function should behave the same.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index daf502c13d6d..e052fc5f7e94 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1320,8 +1320,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   cpumask_var_t mask;
-   unsigned long index;
+   unsigned long *mask, index;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
@@ -1349,7 +1348,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
}
 
-   if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
+   if (!mask) {
ret = -ENOMEM;
goto err_rtnl_unlock;
}
@@ -1367,7 +1367,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
for (i = map->len; i--;) {
if (map->queues[i] == index) {
-   cpumask_set_cpu(cpu, mask);
+   set_bit(cpu, mask);
break;
}
}
@@ -1377,8 +1377,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
rtnl_unlock();
 
-   len = snprintf(buf, PAGE_SIZE, "%*pb\n", cpumask_pr_args(mask));
-   free_cpumask_var(mask);
+   len = bitmap_print_to_pagebuf(false, buf, mask, nr_cpu_ids);
+   bitmap_free(mask);
return len < PAGE_SIZE ? len : -EINVAL;
 
 err_rtnl_unlock:
-- 
2.29.2

[PATCH net-next v2 00/12] net: xps: improve the xps maps handling

2021-02-08 Thread Antoine Tenart

Hello,

A small series moving the xps cpus/rxqs retrieval logic in net-sysfs was
sent[1] and while there was no comments asking for modifications in the
patches themselves, we had discussions about other xps related reworks.
In addition to the patches sent in the previous series[1] (included),
this series has extra patches introducing the modifications we
discussed.

The main change is moving dev->num_tc and dev->nr_ids in the xps maps, to
avoid out-of-bound accesses as those two fields can be updated after the
maps have been allocated. This allows further reworks, to improve the
xps code readability and allow to stop taking the rtnl lock when
reading the maps in sysfs.

Finally, the maps are moved to an array in net_device, which simplifies
the code a lot.

One future improvement may be to remove the use of xps_map_mutex from
net/core/dev.c, but that may require extra care.

Thanks!
Antoine

[1] https://lore.kernel.org/netdev/20210106180428.722521-1-aten...@kernel.org/

Since v1:
  - Reordered the patches to improve readability and avoid introducing
issues in between patches.
  - Use dev_maps->nr_ids to allocate the mask in xps_queue_show but
still default to nr_cpu_ids/dev->num_rx_queues in xps_queue_show
when dev_maps hasn't been allocated yet for backward compatibility.

Antoine Tenart (12):
  net-sysfs: convert xps_cpus_show to bitmap_zalloc
  net-sysfs: store the return of get_netdev_queue_index in an unsigned
int
  net-sysfs: make xps_cpus_show and xps_rxqs_show consistent
  net: embed num_tc in the xps maps
  net: embed nr_ids in the xps maps
  net: assert the rtnl lock is held when calling __netif_set_xps_queue
  net: remove the xps possible_mask
  net: move the xps maps to an array
  net-sysfs: remove the rtnl lock when accessing the xps maps
  net: add an helper to copy xps maps to the new dev_maps
  net: improve queue removal readability in __netif_set_xps_queue
  net-sysfs: move the xps cpus/rxqs retrieval in a common function

 drivers/net/virtio_net.c  |   2 +-
 include/linux/netdevice.h |  27 -
 net/core/dev.c| 233 +++---
 net/core/net-sysfs.c  | 165 +--
 4 files changed, 194 insertions(+), 233 deletions(-)

-- 
2.29.2

[PATCH net-next 08/11] net: assert the rtnl lock is held when calling __netif_set_xps_queue

2021-01-28 Thread Antoine Tenart

Add ASSERT_RTNL at the top of __netif_set_xps_queue and add a comment
about holding the rtnl lock before the function.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 2a0a777390c6..c639761ddb5e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2623,7 +2623,7 @@ static void xps_copy_dev_maps(struct xps_dev_maps 
*dev_maps,
}
 }
 
-/* Must be called under cpus_read_lock */
+/* Must be called under rtnl_lock and cpus_read_lock */
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map)
 {
@@ -2635,6 +2635,8 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
struct xps_map *map, *new_map;
unsigned int nr_ids;
 
+   ASSERT_RTNL();
+
if (dev->num_tc) {
/* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
-- 
2.29.2

[PATCH net-next 06/11] net: improve queue removal readability in __netif_set_xps_queue

2021-01-28 Thread Antoine Tenart

Improve the readability of the loop removing tx-queue from unused
CPUs/rx-queues in __netif_set_xps_queue. The change should only be
cosmetic.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7e5b1a4ae4a5..118cc0985ff1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2792,13 +2792,16 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
/* removes tx-queue from unused CPUs/rx-queues */
for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 j < nr_ids;) {
-   for (i = tc, tci = j * dev_maps->num_tc; i--; tci++)
-   active |= remove_xps_queue(dev_maps, tci, index);
-   if (!netif_attr_test_mask(j, mask, nr_ids) ||
-   !netif_attr_test_online(j, online_mask, nr_ids))
-   active |= remove_xps_queue(dev_maps, tci, index);
-   for (i = dev_maps->num_tc - tc, tci++; --i; tci++)
+   tci = j * dev_maps->num_tc;
+
+   for (i = 0; i < dev_maps->num_tc; i++, tci++) {
+   if (i == tc &&
+   netif_attr_test_mask(j, mask, nr_ids) &&
+   netif_attr_test_online(j, online_mask, nr_ids))
+   continue;
+
active |= remove_xps_queue(dev_maps, tci, index);
+   }
}
 
/* free map if not active */
-- 
2.29.2

[PATCH net-next 07/11] net: xps: embed nr_ids in dev_maps

2021-01-28 Thread Antoine Tenart

Embed nr_ids (the number of cpu for the xps cpus map, and the number of
rxqs for the xps cpus map) in dev_maps. That will help not accessing out
of bound memory if those values change after dev_maps was allocated.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 include/linux/netdevice.h |  4 
 net/core/dev.c| 26 +++---
 net/core/net-sysfs.c  |  4 ++--
 3 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 481307de6983..f923eb97c446 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -780,6 +780,9 @@ struct xps_map {
 /*
  * This structure holds all XPS maps for device.  Maps are indexed by CPU.
  *
+ * We keep track of the number of cpus/rxqs used when the struct is allocated,
+ * in nr_ids. This will help not accessing out-of-bound memory.
+ *
  * We keep track of the number of traffic classes used when the struct is
  * allocated, in num_tc. This will be used to navigate the maps, to ensure 
we're
  * not crossing its upper bound, as the original dev->num_tc can be updated in
@@ -787,6 +790,7 @@ struct xps_map {
  */
 struct xps_dev_maps {
struct rcu_head rcu;
+   unsigned int nr_ids;
s16 num_tc;
struct xps_map __rcu *attr_map[]; /* Either CPUs map or RXQs map */
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index 118cc0985ff1..2a0a777390c6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2505,14 +2505,14 @@ static void reset_xps_maps(struct net_device *dev,
 }
 
 static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
-  struct xps_dev_maps *dev_maps, unsigned int nr_ids,
-  u16 offset, u16 count, bool is_rxqs_map)
+  struct xps_dev_maps *dev_maps, u16 offset, u16 count,
+  bool is_rxqs_map)
 {
+   unsigned int nr_ids = dev_maps->nr_ids;
bool active = false;
int i, j;
 
-   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids),
-j < nr_ids;)
+   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids), j < nr_ids;)
active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
   count);
if (!active)
@@ -2532,7 +2532,6 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 {
const unsigned long *possible_mask = NULL;
struct xps_dev_maps *dev_maps;
-   unsigned int nr_ids;
 
if (!static_key_false(&xps_needed))
return;
@@ -2542,11 +2541,9 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
if (static_key_false(&xps_rxqs_needed)) {
dev_maps = xmap_dereference(dev->xps_rxqs_map);
-   if (dev_maps) {
-   nr_ids = dev->num_rx_queues;
-   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
-  offset, count, true);
-   }
+   if (dev_maps)
+   clean_xps_maps(dev, possible_mask, dev_maps, offset,
+  count, true);
}
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
@@ -2555,9 +2552,7 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
if (num_possible_cpus() > 1)
possible_mask = cpumask_bits(cpu_possible_mask);
-   nr_ids = nr_cpu_ids;
-   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
-  false);
+   clean_xps_maps(dev, possible_mask, dev_maps, offset, count, false);
 
 out_no_maps:
mutex_unlock(&xps_map_mutex);
@@ -2690,6 +2685,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
return -ENOMEM;
}
 
+   new_dev_maps->nr_ids = nr_ids;
new_dev_maps->num_tc = num_tc;
}
 
@@ -3943,7 +3939,7 @@ static int __get_xps_queue_idx(struct net_device *dev, 
struct sk_buff *skb,
struct xps_map *map;
int queue_index = -1;
 
-   if (tc >= dev_maps->num_tc)
+   if (tc >= dev_maps->num_tc || tci >= dev_maps->nr_ids)
return queue_index;
 
tci *= dev_maps->num_tc;
@@ -3982,7 +3978,7 @@ static int get_xps_queue(struct net_device *dev, struct 
net_device *sb_dev,
if (dev_maps) {
int tci = sk_rx_queue_get(sk);
 
-   if (tci >= 0 && tci < dev->num_rx_queues)
+   if (tci >= 0)
queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
  tci);
}
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index f606f3556ad7..5b7123

[PATCH net-next 09/11] net: remove the xps possible_mask

2021-01-28 Thread Antoine Tenart

Remove the xps possible_mask. It was an optimization but we can just
loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That
simplifies the code a bit.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 net/core/dev.c   | 43 ++-
 net/core/net-sysfs.c | 14 +++---
 2 files changed, 17 insertions(+), 40 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c639761ddb5e..d487605d6992 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2504,33 +2504,27 @@ static void reset_xps_maps(struct net_device *dev,
kfree_rcu(dev_maps, rcu);
 }
 
-static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
+static void clean_xps_maps(struct net_device *dev,
   struct xps_dev_maps *dev_maps, u16 offset, u16 count,
   bool is_rxqs_map)
 {
-   unsigned int nr_ids = dev_maps->nr_ids;
bool active = false;
int i, j;
 
-   for (j = -1; j = netif_attrmask_next(j, mask, nr_ids), j < nr_ids;)
-   active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
-  count);
+   for (j = 0; j < dev_maps->nr_ids; j++)
+   active |= remove_xps_queue_cpu(dev, dev_maps, j, offset, count);
if (!active)
reset_xps_maps(dev, dev_maps, is_rxqs_map);
 
-   if (!is_rxqs_map) {
-   for (i = offset + (count - 1); count--; i--) {
+   if (!is_rxqs_map)
+   for (i = offset + (count - 1); count--; i--)
netdev_queue_numa_node_write(
-   netdev_get_tx_queue(dev, i),
-   NUMA_NO_NODE);
-   }
-   }
+   netdev_get_tx_queue(dev, i), NUMA_NO_NODE);
 }
 
 static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
   u16 count)
 {
-   const unsigned long *possible_mask = NULL;
struct xps_dev_maps *dev_maps;
 
if (!static_key_false(&xps_needed))
@@ -2542,17 +2536,14 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
if (static_key_false(&xps_rxqs_needed)) {
dev_maps = xmap_dereference(dev->xps_rxqs_map);
if (dev_maps)
-   clean_xps_maps(dev, possible_mask, dev_maps, offset,
-  count, true);
+   clean_xps_maps(dev, dev_maps, offset, count, true);
}
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
if (!dev_maps)
goto out_no_maps;
 
-   if (num_possible_cpus() > 1)
-   possible_mask = cpumask_bits(cpu_possible_mask);
-   clean_xps_maps(dev, possible_mask, dev_maps, offset, count, false);
+   clean_xps_maps(dev, dev_maps, offset, count, false);
 
 out_no_maps:
mutex_unlock(&xps_map_mutex);
@@ -2627,8 +2618,8 @@ static void xps_copy_dev_maps(struct xps_dev_maps 
*dev_maps,
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map)
 {
-   const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   const unsigned long *online_mask = NULL;
bool active = false, copy = false;
int i, j, tci, numa_node_id = -2;
int maps_sz, num_tc = 1, tc = 0;
@@ -2658,10 +2649,8 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
nr_ids = dev->num_rx_queues;
} else {
maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
-   if (num_possible_cpus() > 1) {
+   if (num_possible_cpus() > 1)
online_mask = cpumask_bits(cpu_online_mask);
-   possible_mask = cpumask_bits(cpu_possible_mask);
-   }
dev_maps = xmap_dereference(dev->xps_cpus_map);
nr_ids = nr_cpu_ids;
}
@@ -2711,8 +2700,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
static_key_slow_inc_cpuslocked(&xps_rxqs_needed);
}
 
-   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
-j < nr_ids;) {
+   for (j = 0; j < nr_ids; j++) {
bool skip_tc = false;
 
tci = j * num_tc + tc;
@@ -2753,8 +2741,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (!dev_maps)
goto out_no_old_maps;
 
-   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
-j < nr_ids;) {
+   for (j = 0; j < nr_ids; j++) {
for (i = num_tc, tci = j * dev_maps->num_tc; i--; tci++) {
map = xmap_dereference(dev_ma

[PATCH net-next 05/11] net: add an helper to copy xps maps to the new dev_maps

2021-01-28 Thread Antoine Tenart

This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps
to new_dev_maps at a given index. The logic should be the same, with an
improved code readability and maintenance.

Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f43281a7367c..7e5b1a4ae4a5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2609,6 +2609,25 @@ static struct xps_map *expand_xps_map(struct xps_map 
*map, int attr_index,
return new_map;
 }
 
+/* Copy xps maps at a given index */
+static void xps_copy_dev_maps(struct xps_dev_maps *dev_maps,
+ struct xps_dev_maps *new_dev_maps, int index,
+ int tc, bool skip_tc)
+{
+   int i, tci = index * dev_maps->num_tc;
+   struct xps_map *map;
+
+   /* copy maps belonging to foreign traffic classes */
+   for (i = 0; i < dev_maps->num_tc; i++, tci++) {
+   if (i == tc && skip_tc)
+   continue;
+
+   /* fill in the new device map from the old device map */
+   map = xmap_dereference(dev_maps->attr_map[tci]);
+   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
+   }
+}
+
 /* Must be called under cpus_read_lock */
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map)
@@ -2696,23 +2715,16 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
 
for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 j < nr_ids;) {
-   /* copy maps belonging to foreign traffic classes */
-   for (i = tc, tci = j * num_tc; copy && i--; tci++) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
-   }
+   bool skip_tc = false;
 
-   /* We need to explicitly update tci as prevous loop
-* could break out early if dev_maps is NULL.
-*/
tci = j * num_tc + tc;
-
if (netif_attr_test_mask(j, mask, nr_ids) &&
netif_attr_test_online(j, online_mask, nr_ids)) {
/* add tx-queue to CPU/rx-queue maps */
int pos = 0;
 
+   skip_tc = true;
+
map = xmap_dereference(new_dev_maps->attr_map[tci]);
while ((pos < map->len) && (map->queues[pos] != index))
pos++;
@@ -2727,18 +2739,11 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
numa_node_id = -1;
}
 #endif
-   } else if (copy) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
}
 
-   /* copy maps belonging to foreign traffic classes */
-   for (i = num_tc - tc, tci++; copy && --i; tci++) {
-   /* fill in the new device map from the old device map */
-   map = xmap_dereference(dev_maps->attr_map[tci]);
-   RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
-   }
+   if (copy)
+   xps_copy_dev_maps(dev_maps, new_dev_maps, j, tc,
+ skip_tc);
}
 
if (is_rxqs_map)
-- 
2.29.2

[PATCH net-next 11/11] net: move the xps maps to an array

2021-01-28 Thread Antoine Tenart

Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.

This should not modify the xps maps behaviour in any way.

Suggested-by: Alexander Duyck 
Signed-off-by: Antoine Tenart 
---
 drivers/net/virtio_net.c  |  2 +-
 include/linux/netdevice.h | 17 +
 net/core/dev.c| 73 +--
 net/core/net-sysfs.c  | 13 +++
 4 files changed, 48 insertions(+), 57 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e63792549..1c98ef44c6a1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1980,7 +1980,7 @@ static void virtnet_set_affinity(struct virtnet_info *vi)
}
virtqueue_set_affinity(vi->rq[i].vq, mask);
virtqueue_set_affinity(vi->sq[i].vq, mask);
-   __netif_set_xps_queue(vi->dev, cpumask_bits(mask), i, false);
+   __netif_set_xps_queue(vi->dev, cpumask_bits(mask), i, XPS_CPUS);
cpumask_clear(mask);
}
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f923eb97c446..e14fc0f13e5f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -762,6 +762,13 @@ struct rx_queue_attribute {
 const char *buf, size_t len);
 };
 
+/* XPS map type and offset of the xps map within net_device->xps_maps[]. */
+enum xps_map_type {
+   XPS_CPUS = 0,
+   XPS_RXQS,
+   XPS_MAPS_MAX,
+};
+
 #ifdef CONFIG_XPS
 /*
  * This structure holds an XPS map which can be of variable length.  The
@@ -1770,8 +1777,7 @@ enum netdev_priv_flags {
  * @tx_queue_len:  Max frames per queue allowed
  * @tx_global_lock:XXX: need comments on this one
  * @xdp_bulkq: XDP device bulk queue
- * @xps_cpus_map:  all CPUs map for XPS device
- * @xps_rxqs_map:  all RXQs map for XPS device
+ * @xps_maps:  all CPUs/RXQs maps for XPS device
  *
  * @xps_maps:  XXX: need comments on this one
  * @miniq_egress:  clsact qdisc specific data for
@@ -2063,8 +2069,7 @@ struct net_device {
struct xdp_dev_bulk_queue __percpu *xdp_bulkq;
 
 #ifdef CONFIG_XPS
-   struct xps_dev_maps __rcu *xps_cpus_map;
-   struct xps_dev_maps __rcu *xps_rxqs_map;
+   struct xps_dev_maps __rcu *xps_maps[XPS_MAPS_MAX];
 #endif
 #ifdef CONFIG_NET_CLS_ACT
struct mini_Qdisc __rcu *miniq_egress;
@@ -3668,7 +3673,7 @@ static inline void netif_wake_subqueue(struct net_device 
*dev, u16 queue_index)
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
u16 index);
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
- u16 index, bool is_rxqs_map);
+ u16 index, enum xps_map_type type);
 
 /**
  * netif_attr_test_mask - Test a CPU or Rx queue set in a mask
@@ -3763,7 +3768,7 @@ static inline int netif_set_xps_queue(struct net_device 
*dev,
 
 static inline int __netif_set_xps_queue(struct net_device *dev,
const unsigned long *mask,
-   u16 index, bool is_rxqs_map)
+   u16 index, enum xps_map_type type)
 {
return 0;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index d487605d6992..a51106b8e1ac 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2492,31 +2492,34 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 
 static void reset_xps_maps(struct net_device *dev,
   struct xps_dev_maps *dev_maps,
-  bool is_rxqs_map)
+  enum xps_map_type type)
 {
-   if (is_rxqs_map) {
-   static_key_slow_dec_cpuslocked(&xps_rxqs_needed);
-   RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
-   } else {
-   RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
-   }
static_key_slow_dec_cpuslocked(&xps_needed);
+   if (type == XPS_RXQS)
+   static_key_slow_dec_cpuslocked(&xps_rxqs_needed);
+
+   RCU_INIT_POINTER(dev->xps_maps[type], NULL);
+
kfree_rcu(dev_maps, rcu);
 }
 
-static void clean_xps_maps(struct net_device *dev,
-  struct xps_dev_maps *dev_maps, u16 offset, u16 count,
-  bool is_rxqs_map)
+static void clean_xps_maps(struct net_device *dev, enum xps_map_type type,
+  u16 offset, u16 count)
 {
+   struct xps_dev_maps *dev_maps;
bool active = false;
int i, j;
 
+   dev_maps = xmap_dereference(dev->xps_maps[type]);
+   if (!dev_maps)
+   return;
+
for (j = 0; j < dev_maps->nr_ids; j++)

[PATCH net-next 10/11] net-sysfs: remove the rtnl lock when accessing the xps maps

2021-01-28 Thread Antoine Tenart

Now that nr_ids and num_tc are stored in the xps dev_maps, which are RCU
protected, we do not have the need to protect the xps_queue_show
function with the rtnl lock.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 15 ---
 1 file changed, 15 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 0c564f288460..08c7a494d0e1 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1314,7 +1314,6 @@ static const struct attribute_group dql_group = {
 #endif /* CONFIG_BQL */
 
 #ifdef CONFIG_XPS
-/* Should be called with the rtnl lock held. */
 static int xps_queue_show(struct net_device *dev, unsigned long **mask,
  unsigned int index, bool is_rxqs_map)
 {
@@ -1375,14 +1374,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue, 
char *buf)
if (!mask)
return -ENOMEM;
 
-   if (!rtnl_trylock()) {
-   bitmap_free(mask);
-   return restart_syscall();
-   }
-
ret = xps_queue_show(dev, &mask, index, false);
-   rtnl_unlock();
-
if (ret) {
bitmap_free(mask);
return ret;
@@ -1447,14 +1439,7 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
if (!mask)
return -ENOMEM;
 
-   if (!rtnl_trylock()) {
-   bitmap_free(mask);
-   return restart_syscall();
-   }
-
ret = xps_queue_show(dev, &mask, index, true);
-   rtnl_unlock();
-
if (ret) {
bitmap_free(mask);
return ret;
-- 
2.29.2

[PATCH net-next 03/11] net-sysfs: move the xps cpus/rxqs retrieval in a common function

2021-01-28 Thread Antoine Tenart

Most of the xps_cpus_show and xps_rxqs_show functions share the same
logic. Having it in two different functions does not help maintenance
and we can already see small implementation differences. This should not
be the case and this patch moves their common logic into a new function,
xps_queue_show, to improve maintenance.

While the rtnl lock could be held in the new xps_queue_show, it is still
held in xps_cpus_show and xps_rxqs_show as this is an important
information when looking at those two functions. This does not add
complexity.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 168 ---
 1 file changed, 79 insertions(+), 89 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 5a39e9b38e5f..6e6bc05181f6 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1314,77 +1314,98 @@ static const struct attribute_group dql_group = {
 #endif /* CONFIG_BQL */
 
 #ifdef CONFIG_XPS
-static ssize_t xps_cpus_show(struct netdev_queue *queue,
-char *buf)
+/* Should be called with the rtnl lock held. */
+static int xps_queue_show(struct net_device *dev, unsigned long **mask,
+ unsigned int index, bool is_rxqs_map)
 {
-   int cpu, len, ret, num_tc = 1, tc = 0;
-   struct net_device *dev = queue->dev;
+   const unsigned long *possible_mask = NULL;
+   int j, num_tc = 0, tc = 0, ret = 0;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask;
-   unsigned int index;
-
-   if (!netif_is_multiqueue(dev))
-   return -ENOENT;
-
-   index = get_netdev_queue_index(queue);
-
-   if (!rtnl_trylock())
-   return restart_syscall();
+   unsigned int nr_ids;
 
if (dev->num_tc) {
/* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
-   if (num_tc < 0) {
-   ret = -EINVAL;
-   goto err_rtnl_unlock;
-   }
+   if (num_tc < 0)
+   return -EINVAL;
 
/* If queue belongs to subordinate dev use its map */
dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
 
tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0) {
-   ret = -EINVAL;
-   goto err_rtnl_unlock;
-   }
+   if (tc < 0)
+   return -EINVAL;
}
 
-   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
-   if (!mask) {
-   ret = -ENOMEM;
-   goto err_rtnl_unlock;
+   rcu_read_lock();
+
+   if (is_rxqs_map) {
+   dev_maps = rcu_dereference(dev->xps_rxqs_map);
+   nr_ids = dev->num_rx_queues;
+   } else {
+   dev_maps = rcu_dereference(dev->xps_cpus_map);
+   nr_ids = nr_cpu_ids;
+   if (num_possible_cpus() > 1)
+   possible_mask = cpumask_bits(cpu_possible_mask);
}
+   if (!dev_maps)
+   goto rcu_unlock;
 
-   rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_cpus_map);
-   if (dev_maps) {
-   for_each_possible_cpu(cpu) {
-   int i, tci = cpu * num_tc + tc;
-   struct xps_map *map;
-
-   map = rcu_dereference(dev_maps->attr_map[tci]);
-   if (!map)
-   continue;
-
-   for (i = map->len; i--;) {
-   if (map->queues[i] == index) {
-   set_bit(cpu, mask);
-   break;
-   }
+   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
+j < nr_ids;) {
+   int i, tci = j * num_tc + tc;
+   struct xps_map *map;
+
+   map = rcu_dereference(dev_maps->attr_map[tci]);
+   if (!map)
+   continue;
+
+   for (i = map->len; i--;) {
+   if (map->queues[i] == index) {
+   set_bit(j, *mask);
+   break;
}
}
}
+
+rcu_unlock:
rcu_read_unlock();
 
+   return ret;
+}
+
+static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf)
+{
+   struct net_device *dev = queue->dev;
+   unsigned long *mask;
+   unsigned int index;
+   int len, ret;
+
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
+   index = get_netdev_queue_index(queue);
+
+   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
+   if (!mask)
+   return -ENOMEM;
+
+   if (!rtnl_trylock()) {
+   bitmap_free(mask);
+   return restart

[PATCH net-next 02/11] net-sysfs: store the return of get_netdev_queue_index in an unsigned int

2021-01-28 Thread Antoine Tenart

In net-sysfs, get_netdev_queue_index returns an unsigned int. Some of
its callers use an unsigned long to store the returned value. Update the
code to be consistent, this should only be cosmetic.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e052fc5f7e94..5a39e9b38e5f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1320,7 +1320,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
@@ -1390,7 +1391,7 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
  const char *buf, size_t len)
 {
struct net_device *dev = queue->dev;
-   unsigned long index;
+   unsigned int index;
cpumask_var_t mask;
int err;
 
@@ -1432,7 +1433,8 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
int j, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
 
index = get_netdev_queue_index(queue);
 
@@ -1494,7 +1496,8 @@ static ssize_t xps_rxqs_store(struct netdev_queue *queue, 
const char *buf,
 {
struct net_device *dev = queue->dev;
struct net *net = dev_net(dev);
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
int err;
 
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
-- 
2.29.2

[PATCH net-next 04/11] net: embed num_tc in the xps maps

2021-01-28 Thread Antoine Tenart

The xps cpus/rxqs map is accessed using dev->num_tc, which is used when
allocating the map. But later updates of dev->num_tc can lead to having
a mismatch between the maps and how they're accessed. In such cases the
map values do not make any sense and out of bound accesses can occur
(that can be easily seen using KASAN).

This patch aims at fixing this by embedding num_tc into the maps, using
the value at the time the map is created. This brings two improvements:
- The maps can be accessed using the embedded num_tc, so we know for
  sure we won't have out of bound accesses.
- Checks can be made before accessing the maps so we know the values
  retrieved will make sense.

We also update __netif_set_xps_queue to conditionally copy old maps from
dev_maps in the new one only if the number of traffic classes from both
maps match.

Signed-off-by: Antoine Tenart 
---
 include/linux/netdevice.h |  6 
 net/core/dev.c| 63 +--
 net/core/net-sysfs.c  | 25 ++--
 3 files changed, 56 insertions(+), 38 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9e8572533d8e..481307de6983 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -779,9 +779,15 @@ struct xps_map {
 
 /*
  * This structure holds all XPS maps for device.  Maps are indexed by CPU.
+ *
+ * We keep track of the number of traffic classes used when the struct is
+ * allocated, in num_tc. This will be used to navigate the maps, to ensure 
we're
+ * not crossing its upper bound, as the original dev->num_tc can be updated in
+ * the meantime.
  */
 struct xps_dev_maps {
struct rcu_head rcu;
+   s16 num_tc;
struct xps_map __rcu *attr_map[]; /* Either CPUs map or RXQs map */
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 6df3f1bcdc68..f43281a7367c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2472,7 +2472,7 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 struct xps_dev_maps *dev_maps,
 int cpu, u16 offset, u16 count)
 {
-   int num_tc = dev->num_tc ? : 1;
+   int num_tc = dev_maps->num_tc;
bool active = false;
int tci;
 
@@ -2615,10 +2615,10 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
 {
const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
+   bool active = false, copy = false;
int i, j, tci, numa_node_id = -2;
int maps_sz, num_tc = 1, tc = 0;
struct xps_map *map, *new_map;
-   bool active = false;
unsigned int nr_ids;
 
if (dev->num_tc) {
@@ -2653,19 +2653,29 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (maps_sz < L1_CACHE_BYTES)
maps_sz = L1_CACHE_BYTES;
 
+   /* The old dev_maps could be larger or smaller than the one we're
+* setting up now, as dev->num_tc could have been updated in between. We
+* could try to be smart, but let's be safe instead and only copy
+* foreign traffic classes if the two map sizes match.
+*/
+   if (dev_maps && dev_maps->num_tc == num_tc)
+   copy = true;
+
/* allocate memory for queue storage */
for (j = -1; j = netif_attrmask_next_and(j, online_mask, mask, nr_ids),
 j < nr_ids;) {
-   if (!new_dev_maps)
-   new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
if (!new_dev_maps) {
-   mutex_unlock(&xps_map_mutex);
-   return -ENOMEM;
+   new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
+   if (!new_dev_maps) {
+   mutex_unlock(&xps_map_mutex);
+   return -ENOMEM;
+   }
+
+   new_dev_maps->num_tc = num_tc;
}
 
tci = j * num_tc + tc;
-   map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
-NULL;
+   map = copy ? xmap_dereference(dev_maps->attr_map[tci]) : NULL;
 
map = expand_xps_map(map, j, index, is_rxqs_map);
if (!map)
@@ -2687,7 +2697,7 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 j < nr_ids;) {
/* copy maps belonging to foreign traffic classes */
-   for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
+   for (i = tc, tci = j * num_tc; copy && i--; tci++) {
/* fill in the new device map from the old device map */
map = xmap_de

[PATCH net-next 00/11] net: xps: improve the xps maps handling

2021-01-28 Thread Antoine Tenart

Hello,

A small series moving the xps cpus/rxqs retrieval logic in net-sysfs was
sent[1] and while there was no comments asking for modifications in the
patches themselves, we had discussions about other xps related reworks.
In addition to the patches sent in the previous series[1] (included),
this series has extra patches introducing the modifications we
discussed.

The main change is moving dev->num_tc and dev->nr_ids in the xps maps, to
avoid out-of-bound accesses as those two fields can be updated after the
maps have been allocated. This allows further reworks, to improve the
xps code readability and allow to stop taking the rtnl lock when
reading the maps in sysfs.

Finally, the maps are moved to an array in net_device, which simplifies
the code a lot.

One future improvement may be to remove the use of xps_map_mutex from
net/core/dev.c, but that may require extra care.

Thanks!
Antoine

[1] https://lore.kernel.org/netdev/20210106180428.722521-1-aten...@kernel.org/

Antoine Tenart (11):
  net-sysfs: convert xps_cpus_show to bitmap_zalloc
  net-sysfs: store the return of get_netdev_queue_index in an unsigned
int
  net-sysfs: move the xps cpus/rxqs retrieval in a common function
  net: embed num_tc in the xps maps
  net: add an helper to copy xps maps to the new dev_maps
  net: improve queue removal readability in __netif_set_xps_queue
  net: xps: embed nr_ids in dev_maps
  net: assert the rtnl lock is held when calling __netif_set_xps_queue
  net: remove the xps possible_mask
  net-sysfs: remove the rtnl lock when accessing the xps maps
  net: move the xps maps to an array

 drivers/net/virtio_net.c  |   2 +-
 include/linux/netdevice.h |  27 -
 net/core/dev.c| 233 +++---
 net/core/net-sysfs.c  | 168 +++
 4 files changed, 202 insertions(+), 228 deletions(-)

-- 
2.29.2

[PATCH net-next 01/11] net-sysfs: convert xps_cpus_show to bitmap_zalloc

2021-01-28 Thread Antoine Tenart

Use bitmap_zalloc instead if zalloc_cpumask_var in xps_cpus_show to
align with xps_rxqs_show. This will improve maintenance and allow us to
factorize the two functions. The function should behave the same.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index daf502c13d6d..e052fc5f7e94 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1320,8 +1320,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   cpumask_var_t mask;
-   unsigned long index;
+   unsigned long *mask, index;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
@@ -1349,7 +1348,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
}
 
-   if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
+   if (!mask) {
ret = -ENOMEM;
goto err_rtnl_unlock;
}
@@ -1367,7 +1367,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
for (i = map->len; i--;) {
if (map->queues[i] == index) {
-   cpumask_set_cpu(cpu, mask);
+   set_bit(cpu, mask);
break;
}
}
@@ -1377,8 +1377,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
rtnl_unlock();
 
-   len = snprintf(buf, PAGE_SIZE, "%*pb\n", cpumask_pr_args(mask));
-   free_cpumask_var(mask);
+   len = bitmap_print_to_pagebuf(false, buf, mask, nr_cpu_ids);
+   bitmap_free(mask);
return len < PAGE_SIZE ? len : -EINVAL;
 
 err_rtnl_unlock:
-- 
2.29.2

Re: [PATCH net 3/3] net-sysfs: move the xps cpus/rxqs retrieval in a common function

2021-01-08 Thread Antoine Tenart

Quoting Alexander Duyck (2021-01-08 23:04:57)
> On Fri, Jan 8, 2021 at 10:58 AM Antoine Tenart  wrote:
> >
> > Quoting Alexander Duyck (2021-01-08 17:33:01)
> > > On Fri, Jan 8, 2021 at 1:07 AM Antoine Tenart  wrote:
> > > >
> > > > Quoting Alexander Duyck (2021-01-07 17:38:35)
> > > > > On Thu, Jan 7, 2021 at 12:54 AM Antoine Tenart  
> > > > > wrote:
> > > > > >
> > > > > > Quoting Alexander Duyck (2021-01-06 20:54:11)
> > > > > > > On Wed, Jan 6, 2021 at 10:04 AM Antoine Tenart 
> > > > > > >  wrote:
> > > > > > > > if (dev->num_tc) {
> > > > > > > > /* Do not allow XPS on subordinate device 
> > > > > > > > directly */
> > > > > > > > num_tc = dev->num_tc;
> > > > > > > > -   if (num_tc < 0) {
> > > > > > > > -   ret = -EINVAL;
> > > > > > > > -   goto err_rtnl_unlock;
> > > > > > > > -   }
> > > > > > > > +   if (num_tc < 0)
> > > > > > > > +   return -EINVAL;
> > > > > > > >
> > > > > > > > /* If queue belongs to subordinate dev use its 
> > > > > > > > map */
> > > > > > > > dev = netdev_get_tx_queue(dev, index)->sb_dev ? 
> > > > > > > > : dev;
> > > > > > > >
> > > > > > > > tc = netdev_txq_to_tc(dev, index);
> > > > > > > > -   if (tc < 0) {
> > > > > > > > -   ret = -EINVAL;
> > > > > > > > -   goto err_rtnl_unlock;
> > > > > > > > -   }
> > > > > > > > +   if (tc < 0)
> > > > > > > > +   return -EINVAL;
> > > > > > > > }
> > > > > > > >
> > > > > > >
> > > > > > > So if we store the num_tc and nr_ids in the dev_maps structure 
> > > > > > > then we
> > > > > > > could simplify this a bit by pulling the num_tc info out of the
> > > > > > > dev_map and only asking the Tx queue for the tc in that case and
> > > > > > > validating it against (tc <0 || num_tc <= tc) and returning an 
> > > > > > > error
> > > > > > > if either are true.
> > > > > > >
> > > > > > > This would also allow us to address the fact that the rxqs feature
> > > > > > > doesn't support the subordinate devices as you could pull out the 
> > > > > > > bit
> > > > > > > above related to the sb_dev and instead call that prior to calling
> > > > > > > xps_queue_show so that you are operating on the correct device 
> > > > > > > map.
> > > > >
> > > > > It probably would be necessary to pass dev and index if we pull the
> > > > > netdev_get_tx_queue()->sb_dev bit out and performed that before we
> > > > > called the xps_queue_show function. Specifically as the subordinate
> > > > > device wouldn't match up with the queue device so we would be better
> > > > > off pulling it out first.
> > > >
> > > > While I agree moving the netdev_get_tx_queue()->sb_dev bit out of
> > > > xps_queue_show seems like a good idea for consistency, I'm not sure
> > > > it'll work: dev->num_tc is not only used to retrieve the number of tc
> > > > but also as a condition on not being 0. We have things like:
> > > >
> > > >   // Always done with the original dev.
> > > >   if (dev->num_tc) {
> > > >
> > > >   [...]
> > > >
> > > >   // Can be a subordinate dev.
> > > >   tc = netdev_txq_to_tc(dev, index);
> > > >   }
> > > >
> > > > And after moving num_tc in the map, we'll have checks like:
> > > >
> > > >   if (dev_maps->num_tc != dev->num_tc)
> > > >   return -EINVAL;
> > >
> > > We shouldn't. That

Re: [PATCH net 3/3] net-sysfs: move the xps cpus/rxqs retrieval in a common function

2021-01-08 Thread Antoine Tenart

Quoting Alexander Duyck (2021-01-08 17:33:01)
> On Fri, Jan 8, 2021 at 1:07 AM Antoine Tenart  wrote:
> >
> > Quoting Alexander Duyck (2021-01-07 17:38:35)
> > > On Thu, Jan 7, 2021 at 12:54 AM Antoine Tenart  wrote:
> > > >
> > > > Quoting Alexander Duyck (2021-01-06 20:54:11)
> > > > > On Wed, Jan 6, 2021 at 10:04 AM Antoine Tenart  
> > > > > wrote:
> > > >
> > > > That would require to hold rcu_read_lock in the caller and I'd like to
> > > > keep it in that function.
> > >
> > > Actually you could probably make it work if you were to pass a pointer
> > > to the RCU pointer.
> >
> > That should work but IMHO that could be easily breakable by future
> > changes as it's a bit tricky.
> >
> > > > > > if (dev->num_tc) {
> > > > > > /* Do not allow XPS on subordinate device directly 
> > > > > > */
> > > > > > num_tc = dev->num_tc;
> > > > > > -   if (num_tc < 0) {
> > > > > > -   ret = -EINVAL;
> > > > > > -   goto err_rtnl_unlock;
> > > > > > -   }
> > > > > > +   if (num_tc < 0)
> > > > > > +   return -EINVAL;
> > > > > >
> > > > > > /* If queue belongs to subordinate dev use its map 
> > > > > > */
> > > > > > dev = netdev_get_tx_queue(dev, index)->sb_dev ? : 
> > > > > > dev;
> > > > > >
> > > > > > tc = netdev_txq_to_tc(dev, index);
> > > > > > -   if (tc < 0) {
> > > > > > -   ret = -EINVAL;
> > > > > > -   goto err_rtnl_unlock;
> > > > > > -   }
> > > > > > +   if (tc < 0)
> > > > > > +   return -EINVAL;
> > > > > > }
> > > > > >
> > > > >
> > > > > So if we store the num_tc and nr_ids in the dev_maps structure then we
> > > > > could simplify this a bit by pulling the num_tc info out of the
> > > > > dev_map and only asking the Tx queue for the tc in that case and
> > > > > validating it against (tc <0 || num_tc <= tc) and returning an error
> > > > > if either are true.
> > > > >
> > > > > This would also allow us to address the fact that the rxqs feature
> > > > > doesn't support the subordinate devices as you could pull out the bit
> > > > > above related to the sb_dev and instead call that prior to calling
> > > > > xps_queue_show so that you are operating on the correct device map.
> > >
> > > It probably would be necessary to pass dev and index if we pull the
> > > netdev_get_tx_queue()->sb_dev bit out and performed that before we
> > > called the xps_queue_show function. Specifically as the subordinate
> > > device wouldn't match up with the queue device so we would be better
> > > off pulling it out first.
> >
> > While I agree moving the netdev_get_tx_queue()->sb_dev bit out of
> > xps_queue_show seems like a good idea for consistency, I'm not sure
> > it'll work: dev->num_tc is not only used to retrieve the number of tc
> > but also as a condition on not being 0. We have things like:
> >
> >   // Always done with the original dev.
> >   if (dev->num_tc) {
> >
> >   [...]
> >
> >   // Can be a subordinate dev.
> >   tc = netdev_txq_to_tc(dev, index);
> >   }
> >
> > And after moving num_tc in the map, we'll have checks like:
> >
> >   if (dev_maps->num_tc != dev->num_tc)
> >   return -EINVAL;
> 
> We shouldn't. That defeats the whole point and you will never be able
> to rely on the num_tc in the device to remain constant. If we are
> moving the value to an RCU accessible attribute we should just be
> using that value. We can only use that check if we are in an rtnl_lock
> anyway and we won't need that if we are just displaying the value.
> 
> The checks should only be used to verify the tc of the queue is within
> the bounds of the num_tc of the xps_map.

Right. So that would mean we have to choose between:

- Removing the rtnl lock but with the

Re: [PATCH net 3/3] net-sysfs: move the xps cpus/rxqs retrieval in a common function

2021-01-08 Thread Antoine Tenart

Quoting Alexander Duyck (2021-01-07 17:38:35)
> On Thu, Jan 7, 2021 at 12:54 AM Antoine Tenart  wrote:
> >
> > Quoting Alexander Duyck (2021-01-06 20:54:11)
> > > On Wed, Jan 6, 2021 at 10:04 AM Antoine Tenart  wrote:
> >
> > That would require to hold rcu_read_lock in the caller and I'd like to
> > keep it in that function.
> 
> Actually you could probably make it work if you were to pass a pointer
> to the RCU pointer.

That should work but IMHO that could be easily breakable by future
changes as it's a bit tricky.

> > > > if (dev->num_tc) {
> > > > /* Do not allow XPS on subordinate device directly */
> > > > num_tc = dev->num_tc;
> > > > -   if (num_tc < 0) {
> > > > -   ret = -EINVAL;
> > > > -   goto err_rtnl_unlock;
> > > > -   }
> > > > +   if (num_tc < 0)
> > > > +   return -EINVAL;
> > > >
> > > > /* If queue belongs to subordinate dev use its map */
> > > > dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
> > > >
> > > > tc = netdev_txq_to_tc(dev, index);
> > > > -   if (tc < 0) {
> > > > -   ret = -EINVAL;
> > > > -   goto err_rtnl_unlock;
> > > > -   }
> > > > +   if (tc < 0)
> > > > +   return -EINVAL;
> > > > }
> > > >
> > >
> > > So if we store the num_tc and nr_ids in the dev_maps structure then we
> > > could simplify this a bit by pulling the num_tc info out of the
> > > dev_map and only asking the Tx queue for the tc in that case and
> > > validating it against (tc <0 || num_tc <= tc) and returning an error
> > > if either are true.
> > >
> > > This would also allow us to address the fact that the rxqs feature
> > > doesn't support the subordinate devices as you could pull out the bit
> > > above related to the sb_dev and instead call that prior to calling
> > > xps_queue_show so that you are operating on the correct device map.
> 
> It probably would be necessary to pass dev and index if we pull the
> netdev_get_tx_queue()->sb_dev bit out and performed that before we
> called the xps_queue_show function. Specifically as the subordinate
> device wouldn't match up with the queue device so we would be better
> off pulling it out first.

While I agree moving the netdev_get_tx_queue()->sb_dev bit out of
xps_queue_show seems like a good idea for consistency, I'm not sure
it'll work: dev->num_tc is not only used to retrieve the number of tc
but also as a condition on not being 0. We have things like:

  // Always done with the original dev.
  if (dev->num_tc) {

  [...]

  // Can be a subordinate dev.
  tc = netdev_txq_to_tc(dev, index);
  }

And after moving num_tc in the map, we'll have checks like:

  if (dev_maps->num_tc != dev->num_tc)
  return -EINVAL;

I don't think the subordinate dev holds the same num_tc value as dev.
What's your take on this?

> > > > -   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
> > > > -   if (!mask) {
> > > > -   ret = -ENOMEM;
> > > > -   goto err_rtnl_unlock;
> > > > +   rcu_read_lock();
> > > > +
> > > > +   if (is_rxqs_map) {
> > > > +   dev_maps = rcu_dereference(dev->xps_rxqs_map);
> > > > +   nr_ids = dev->num_rx_queues;
> > > > +   } else {
> > > > +   dev_maps = rcu_dereference(dev->xps_cpus_map);
> > > > +   nr_ids = nr_cpu_ids;
> > > > +   if (num_possible_cpus() > 1)
> > > > +   possible_mask = cpumask_bits(cpu_possible_mask);
> > > > }
> > >
> 
> I don't think we need the possible_mask check. That is mostly just an
> optimization that was making use of an existing "for_each" loop macro.
> If we are going to go through 0 through nr_ids then there is no need
> for the possible_mask as we can just brute force our way through and
> will not find CPU that aren't there since we couldn't have added them
> to the map anyway.

I'll remove it then. __netif_set_xps_queue could also benefit from it.

> > > I think Jakub had mentioned earlier the idea of

Re: [PATCH net 3/3] net-sysfs: move the xps cpus/rxqs retrieval in a common function

2021-01-07 Thread Antoine Tenart

Quoting Alexander Duyck (2021-01-06 20:54:11)
> On Wed, Jan 6, 2021 at 10:04 AM Antoine Tenart  wrote:
> > +/* Should be called with the rtnl lock held. */
> > +static int xps_queue_show(struct net_device *dev, unsigned long **mask,
> > + unsigned int index, bool is_rxqs_map)
> 
> Why pass dev and index instead of just the queue which already
> contains both?

Right, I can do that.

> I think it would make more sense to just stick to passing the queue
> through along with a pointer to the xps_dev_maps value that we need to
> read.

That would require to hold rcu_read_lock in the caller and I'd like to
keep it in that function.

> > if (dev->num_tc) {
> > /* Do not allow XPS on subordinate device directly */
> > num_tc = dev->num_tc;
> > -   if (num_tc < 0) {
> > -   ret = -EINVAL;
> > -   goto err_rtnl_unlock;
> > -   }
> > +   if (num_tc < 0)
> > +   return -EINVAL;
> >
> > /* If queue belongs to subordinate dev use its map */
> > dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
> >
> > tc = netdev_txq_to_tc(dev, index);
> > -   if (tc < 0) {
> > -   ret = -EINVAL;
> > -   goto err_rtnl_unlock;
> > -   }
> > +   if (tc < 0)
> > +   return -EINVAL;
> > }
> >
> 
> So if we store the num_tc and nr_ids in the dev_maps structure then we
> could simplify this a bit by pulling the num_tc info out of the
> dev_map and only asking the Tx queue for the tc in that case and
> validating it against (tc <0 || num_tc <= tc) and returning an error
> if either are true.
> 
> This would also allow us to address the fact that the rxqs feature
> doesn't support the subordinate devices as you could pull out the bit
> above related to the sb_dev and instead call that prior to calling
> xps_queue_show so that you are operating on the correct device map.
> 
> > -   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
> > -   if (!mask) {
> > -   ret = -ENOMEM;
> > -   goto err_rtnl_unlock;
> > +   rcu_read_lock();
> > +
> > +   if (is_rxqs_map) {
> > +   dev_maps = rcu_dereference(dev->xps_rxqs_map);
> > +   nr_ids = dev->num_rx_queues;
> > +   } else {
> > +   dev_maps = rcu_dereference(dev->xps_cpus_map);
> > +   nr_ids = nr_cpu_ids;
> > +   if (num_possible_cpus() > 1)
> > +   possible_mask = cpumask_bits(cpu_possible_mask);
> > }
> 
> I think Jakub had mentioned earlier the idea of possibly moving some
> fields into the xps_cpus_map and xps_rxqs_map in order to reduce the
> complexity of this so that certain values would be protected by the
> RCU lock.
> 
> This might be a good time to look at encoding things like the number
> of IDs and the number of TCs there in order to avoid a bunch of this
> duplication. Then you could just pass a pointer to the map you want to
> display and the code should be able to just dump the values.:

100% agree to all the above. That would also prevent from making out of
bound accesses when dev->num_tc is increased after dev_maps is
allocated. I do have a series ready to be send storing num_tc into the
maps, and reworking code to use it instead of dev->num_tc. The series
also adds checks to ensure the map is valid when we access it (such as
making sure dev->num_tc == map->num_tc). I however did not move nr_ids
into the map yet, but I'll look into it.

The idea is to send it as a follow up series, as this one is only moving
code around to improve maintenance and readability. Even if all the
patches were in the same series that would be a prerequisite.

Thanks!
Antoine

[PATCH net 1/3] net-sysfs: convert xps_cpus_show to bitmap_zalloc

2021-01-06 Thread Antoine Tenart

Use bitmap_zalloc instead if zalloc_cpumask_var in xps_cpus_show to
align with xps_rxqs_show. This will improve maintenance and allow us to
factorize the two functions. The function should behave the same.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index daf502c13d6d..e052fc5f7e94 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1320,8 +1320,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   cpumask_var_t mask;
-   unsigned long index;
+   unsigned long *mask, index;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
@@ -1349,7 +1348,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
}
 
-   if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
+   if (!mask) {
ret = -ENOMEM;
goto err_rtnl_unlock;
}
@@ -1367,7 +1367,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
for (i = map->len; i--;) {
if (map->queues[i] == index) {
-   cpumask_set_cpu(cpu, mask);
+   set_bit(cpu, mask);
break;
}
}
@@ -1377,8 +1377,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
rtnl_unlock();
 
-   len = snprintf(buf, PAGE_SIZE, "%*pb\n", cpumask_pr_args(mask));
-   free_cpumask_var(mask);
+   len = bitmap_print_to_pagebuf(false, buf, mask, nr_cpu_ids);
+   bitmap_free(mask);
return len < PAGE_SIZE ? len : -EINVAL;
 
 err_rtnl_unlock:
-- 
2.29.2

[PATCH net 0/3] net-sysfs: move the xps cpus/rxqs retrieval in a common function

2021-01-06 Thread Antoine Tenart

Hello,

In net-sysfs, the xps_cpus_show and xps_rxqs_show functions share the
same logic. To improve readability and maintenance, as discussed
here[1], this series moves their common logic to a new function.

Patches 1/3 and 2/3 are prerequisites for the factorization to happen,
so that patch 3/3 looks better and is easier to review.

Thanks!
Antoine

[1] 
https://lore.kernel.org/netdev/160875219353.1783433.8066935261216141538@kwain.local/

Antoine Tenart (3):
  net-sysfs: convert xps_cpus_show to bitmap_zalloc
  net-sysfs: store the return of get_netdev_queue_index in an unsigned
int
  net-sysfs: move the xps cpus/rxqs retrieval in a common function

 net/core/net-sysfs.c | 179 +--
 1 file changed, 86 insertions(+), 93 deletions(-)

-- 
2.29.2

[PATCH net 2/3] net-sysfs: store the return of get_netdev_queue_index in an unsigned int

2021-01-06 Thread Antoine Tenart

In net-sysfs, get_netdev_queue_index returns an unsigned int. Some of
its callers use an unsigned long to store the returned value. Update the
code to be consistent, this should only be cosmetic.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e052fc5f7e94..5a39e9b38e5f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1320,7 +1320,8 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
@@ -1390,7 +1391,7 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
  const char *buf, size_t len)
 {
struct net_device *dev = queue->dev;
-   unsigned long index;
+   unsigned int index;
cpumask_var_t mask;
int err;
 
@@ -1432,7 +1433,8 @@ static ssize_t xps_rxqs_show(struct netdev_queue *queue, 
char *buf)
int j, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
 
index = get_netdev_queue_index(queue);
 
@@ -1494,7 +1496,8 @@ static ssize_t xps_rxqs_store(struct netdev_queue *queue, 
const char *buf,
 {
struct net_device *dev = queue->dev;
struct net *net = dev_net(dev);
-   unsigned long *mask, index;
+   unsigned long *mask;
+   unsigned int index;
int err;
 
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
-- 
2.29.2

[PATCH net 3/3] net-sysfs: move the xps cpus/rxqs retrieval in a common function

2021-01-06 Thread Antoine Tenart

Most of the xps_cpus_show and xps_rxqs_show functions share the same
logic. Having it in two different functions does not help maintenance
and we can already see small implementation differences. This should not
be the case and this patch moves their common logic into a new function,
xps_queue_show, to improve maintenance.

While the rtnl lock could be held in the new xps_queue_show, it is still
held in xps_cpus_show and xps_rxqs_show as this is an important
information when looking at those two functions. This does not add
complexity.

Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 168 ---
 1 file changed, 79 insertions(+), 89 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 5a39e9b38e5f..6e6bc05181f6 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1314,77 +1314,98 @@ static const struct attribute_group dql_group = {
 #endif /* CONFIG_BQL */
 
 #ifdef CONFIG_XPS
-static ssize_t xps_cpus_show(struct netdev_queue *queue,
-char *buf)
+/* Should be called with the rtnl lock held. */
+static int xps_queue_show(struct net_device *dev, unsigned long **mask,
+ unsigned int index, bool is_rxqs_map)
 {
-   int cpu, len, ret, num_tc = 1, tc = 0;
-   struct net_device *dev = queue->dev;
+   const unsigned long *possible_mask = NULL;
+   int j, num_tc = 0, tc = 0, ret = 0;
struct xps_dev_maps *dev_maps;
-   unsigned long *mask;
-   unsigned int index;
-
-   if (!netif_is_multiqueue(dev))
-   return -ENOENT;
-
-   index = get_netdev_queue_index(queue);
-
-   if (!rtnl_trylock())
-   return restart_syscall();
+   unsigned int nr_ids;
 
if (dev->num_tc) {
/* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
-   if (num_tc < 0) {
-   ret = -EINVAL;
-   goto err_rtnl_unlock;
-   }
+   if (num_tc < 0)
+   return -EINVAL;
 
/* If queue belongs to subordinate dev use its map */
dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
 
tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0) {
-   ret = -EINVAL;
-   goto err_rtnl_unlock;
-   }
+   if (tc < 0)
+   return -EINVAL;
}
 
-   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
-   if (!mask) {
-   ret = -ENOMEM;
-   goto err_rtnl_unlock;
+   rcu_read_lock();
+
+   if (is_rxqs_map) {
+   dev_maps = rcu_dereference(dev->xps_rxqs_map);
+   nr_ids = dev->num_rx_queues;
+   } else {
+   dev_maps = rcu_dereference(dev->xps_cpus_map);
+   nr_ids = nr_cpu_ids;
+   if (num_possible_cpus() > 1)
+   possible_mask = cpumask_bits(cpu_possible_mask);
}
+   if (!dev_maps)
+   goto rcu_unlock;
 
-   rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_cpus_map);
-   if (dev_maps) {
-   for_each_possible_cpu(cpu) {
-   int i, tci = cpu * num_tc + tc;
-   struct xps_map *map;
-
-   map = rcu_dereference(dev_maps->attr_map[tci]);
-   if (!map)
-   continue;
-
-   for (i = map->len; i--;) {
-   if (map->queues[i] == index) {
-   set_bit(cpu, mask);
-   break;
-   }
+   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
+j < nr_ids;) {
+   int i, tci = j * num_tc + tc;
+   struct xps_map *map;
+
+   map = rcu_dereference(dev_maps->attr_map[tci]);
+   if (!map)
+   continue;
+
+   for (i = map->len; i--;) {
+   if (map->queues[i] == index) {
+   set_bit(j, *mask);
+   break;
}
}
}
+
+rcu_unlock:
rcu_read_unlock();
 
+   return ret;
+}
+
+static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf)
+{
+   struct net_device *dev = queue->dev;
+   unsigned long *mask;
+   unsigned int index;
+   int len, ret;
+
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
+   index = get_netdev_queue_index(queue);
+
+   mask = bitmap_zalloc(nr_cpu_ids, GFP_KERNEL);
+   if (!mask)
+   return -ENOMEM;
+
+   if (!rtnl_trylock()) {
+   bitmap_free(mask);
+   return restart

[PATCH net v3 1/4] net-sysfs: take the rtnl lock when storing xps_cpus

2020-12-23 Thread Antoine Tenart

Two race conditions can be triggered when storing xps cpus, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

   - netif_set_xps_queue uses dev->tc_num as one of the parameters to
 compute the size of new_dev_maps when allocating it. dev->tc_num is
 also used to access the map, and the compiler may generate code to
 retrieve this field multiple times in the function.

   - netdev_set_num_tc sets dev->tc_num.

   If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
   is set to a higher value through netdev_set_num_tc, later accesses to
   new_dev_maps in netif_set_xps_queue could lead to accessing memory
   outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

   2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.

   2.2. netif_set_xps_queue is called, setting up the map with the
*old* dev->num_tc.

   2.3. netdev_set_num_tc updates dev->tc_num.

   2.4. Later accesses to the map lead to out of bound accesses and
oops.

   A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_cpus in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_cpus_store.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 999b70c59761..7cc15dec1717 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1396,7 +1396,13 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
return err;
}
 
+   if (!rtnl_trylock()) {
+   free_cpumask_var(mask);
+   return restart_syscall();
+   }
+
err = netif_set_xps_queue(dev, mask, index);
+   rtnl_unlock();
 
free_cpumask_var(mask);
 
-- 
2.29.2

[PATCH net v3 3/4] net-sysfs: take the rtnl lock when storing xps_rxqs

2020-12-23 Thread Antoine Tenart

Two race conditions can be triggered when storing xps rxqs, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

   - netif_set_xps_queue uses dev->tc_num as one of the parameters to
 compute the size of new_dev_maps when allocating it. dev->tc_num is
 also used to access the map, and the compiler may generate code to
 retrieve this field multiple times in the function.

   - netdev_set_num_tc sets dev->tc_num.

   If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
   is set to a higher value through netdev_set_num_tc, later accesses to
   new_dev_maps in netif_set_xps_queue could lead to accessing memory
   outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

   2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.

   2.2. netif_set_xps_queue is called, setting up the map with the
*old* dev->num_tc.

   2.3. netdev_set_num_tc updates dev->tc_num.

   2.4. Later accesses to the map lead to out of bound accesses and
oops.

   A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_rxqs in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_rxqs_store.

Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx 
queue")
Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 65886bfbf822..62ca2f2c0ee6 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1499,10 +1499,17 @@ static ssize_t xps_rxqs_store(struct netdev_queue 
*queue, const char *buf,
return err;
}
 
+   if (!rtnl_trylock()) {
+   bitmap_free(mask);
+   return restart_syscall();
+   }
+
cpus_read_lock();
err = __netif_set_xps_queue(dev, mask, index, true);
cpus_read_unlock();
 
+   rtnl_unlock();
+
bitmap_free(mask);
return err ? : len;
 }
-- 
2.29.2

[PATCH net v3 4/4] net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc

2020-12-23 Thread Antoine Tenart

Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx 
queue")
Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 62ca2f2c0ee6..daf502c13d6d 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1429,22 +1429,29 @@ static struct netdev_queue_attribute xps_cpus_attribute 
__ro_after_init
 
 static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
 {
+   int j, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
unsigned long *mask, index;
-   int j, len, num_tc = 1, tc = 0;
 
index = get_netdev_queue_index(queue);
 
+   if (!rtnl_trylock())
+   return restart_syscall();
+
if (dev->num_tc) {
num_tc = dev->num_tc;
tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0)
-   return -EINVAL;
+   if (tc < 0) {
+   ret = -EINVAL;
+   goto err_rtnl_unlock;
+   }
}
mask = bitmap_zalloc(dev->num_rx_queues, GFP_KERNEL);
-   if (!mask)
-   return -ENOMEM;
+   if (!mask) {
+   ret = -ENOMEM;
+   goto err_rtnl_unlock;
+   }
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_rxqs_map);
@@ -1470,10 +1477,16 @@ static ssize_t xps_rxqs_show(struct netdev_queue 
*queue, char *buf)
 out_no_maps:
rcu_read_unlock();
 
+   rtnl_unlock();
+
len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
bitmap_free(mask);
 
return len < PAGE_SIZE ? len : -EINVAL;
+
+err_rtnl_unlock:
+   rtnl_unlock();
+   return ret;
 }
 
 static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
-- 
2.29.2

[PATCH net v3 2/4] net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc

2020-12-23 Thread Antoine Tenart

Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 7cc15dec1717..65886bfbf822 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1317,8 +1317,8 @@ static const struct attribute_group dql_group = {
 static ssize_t xps_cpus_show(struct netdev_queue *queue,
 char *buf)
 {
+   int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
-   int cpu, len, num_tc = 1, tc = 0;
struct xps_dev_maps *dev_maps;
cpumask_var_t mask;
unsigned long index;
@@ -1328,22 +1328,31 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
index = get_netdev_queue_index(queue);
 
+   if (!rtnl_trylock())
+   return restart_syscall();
+
if (dev->num_tc) {
/* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
-   if (num_tc < 0)
-   return -EINVAL;
+   if (num_tc < 0) {
+   ret = -EINVAL;
+   goto err_rtnl_unlock;
+   }
 
/* If queue belongs to subordinate dev use its map */
dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
 
tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0)
-   return -EINVAL;
+   if (tc < 0) {
+   ret = -EINVAL;
+   goto err_rtnl_unlock;
+   }
}
 
-   if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
-   return -ENOMEM;
+   if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+   ret = -ENOMEM;
+   goto err_rtnl_unlock;
+   }
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_cpus_map);
@@ -1366,9 +1375,15 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
rcu_read_unlock();
 
+   rtnl_unlock();
+
len = snprintf(buf, PAGE_SIZE, "%*pb\n", cpumask_pr_args(mask));
free_cpumask_var(mask);
return len < PAGE_SIZE ? len : -EINVAL;
+
+err_rtnl_unlock:
+   rtnl_unlock();
+   return ret;
 }
 
 static ssize_t xps_cpus_store(struct netdev_queue *queue,
-- 
2.29.2

[PATCH net v3 0/4] net-sysfs: fix race conditions in the xps code

2020-12-23 Thread Antoine Tenart

Hello all,

This series fixes race conditions in the xps code, where out of bound
accesses can occur when dev->num_tc is updated, triggering oops. The
root cause is linked to locking issues. An explanation is given in each
of the commit logs.

We had a discussion on the v1 of this series about using the xps_map
mutex instead of the rtnl lock. While that seemed a better compromise,
v2 showed the added complexity wasn't best for fixes. So we decided to
go back to v1 and use the rtnl lock.

Because of this, the only differences between v1 and v3 are improvements
in the commit messages.

Thanks!
Antoine

Antoine Tenart (4):
  net-sysfs: take the rtnl lock when storing xps_cpus
  net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc
  net-sysfs: take the rtnl lock when storing xps_rxqs
  net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc

 net/core/net-sysfs.c | 65 
 1 file changed, 53 insertions(+), 12 deletions(-)

-- 
2.29.2

Re: [PATCH net v2 1/3] net: fix race conditions in xps by locking the maps and dev->tc_num

2020-12-23 Thread Antoine Tenart

Quoting Jakub Kicinski (2020-12-23 21:43:15)
> On Wed, 23 Dec 2020 21:35:15 +0100 Antoine Tenart wrote:
> > > > - For net-next, to resend patches 2 and 3 from v2 (they'll have to be
> > > >   slightly reworked, to take into account the review from Alexander and
> > > >   the rtnl lock). The patches can be sent once the ones for net land in
> > > >   net-next.  
> > > 
> > > If the direction is to remove xps_map_mutex, why would we need patch 2?
> > > 🤔  
> > 
> > Only the patches for net are needed to fix the race conditions.
> > 
> > In addition to use the xps_map mutex, patches 2 and 3 from v2 factorize
> > the code into a single function, as xps_cpus_show and xps_rxqs_show
> > share the same logic. That would improve maintainability, but isn't
> > mandatory.
> > 
> > Sorry, it was not very clear...
> 
> I like the cleanup, sorry I'm net very clear either.
> 
> My understanding was that patch 2 was only needed to have access to the
> XPS lock, if we don't plan to use that lock netif_show_xps_queue() can
> stay in the sysfs file, right? I'm all for the cleanup and code reuse
> for rxqs, I'm just making sure I'm not missing anything. I wasn't
> seeing a reason to move netif_show_xps_queue(), that's all.

You understood correctly, the only reason to move this code out of sysfs
was to access the xps_map lock. Without the need, the code can stay in
sysfs.

Patch 2 is not only moving the code out of sysfs, but also reworking
xps_cpus_show. I think I now see where the confusion comes from: the
reason patches 2 and 3 were in two different patches was because they
were targeting net and different kernel versions. They could be squashed
now.

Antoine

Re: [PATCH net v2 1/3] net: fix race conditions in xps by locking the maps and dev->tc_num

2020-12-23 Thread Antoine Tenart

Quoting Jakub Kicinski (2020-12-23 21:11:10)
> On Wed, 23 Dec 2020 20:36:33 +0100 Antoine Tenart wrote:
> > Quoting Jakub Kicinski (2020-12-23 19:27:29)
> > > On Tue, 22 Dec 2020 08:12:28 -0800 Alexander Duyck wrote:  
> > > > On Tue, Dec 22, 2020 at 1:21 AM Antoine Tenart  
> > > > wrote:
> > > >   
> > > > > If I understood correctly, as things are a bit too complex now, you
> > > > > would prefer that we go for the solution proposed in v1?
> > > > 
> > > > Yeah, that is what I am thinking. Basically we just need to make sure
> > > > the num_tc cannot be updated while we are reading the other values.  
> > > 
> > > Yeah, okay, as much as I dislike this approach 300 lines may be a little
> > > too large for stable.
> > >   
> > > > > I can still do the code factoring for the 2 sysfs show operations, but
> > > > > that would then target net-next and would be in a different series. 
> > > > > So I
> > > > > believe we'll use the patches of v1, unmodified.
> > > 
> > > Are you saying just patch 3 for net-next?  
> > 
> > The idea would be to:
> > 
> > - For net, to take all 4 patches from v1. If so, do I need to resend
> >   them?
> 
> Yes, please.

Will do.

> > - For net-next, to resend patches 2 and 3 from v2 (they'll have to be
> >   slightly reworked, to take into account the review from Alexander and
> >   the rtnl lock). The patches can be sent once the ones for net land in
> >   net-next.
> 
> If the direction is to remove xps_map_mutex, why would we need patch 2?
> 🤔

Only the patches for net are needed to fix the race conditions.

In addition to use the xps_map mutex, patches 2 and 3 from v2 factorize
the code into a single function, as xps_cpus_show and xps_rxqs_show
share the same logic. That would improve maintainability, but isn't
mandatory.

Sorry, it was not very clear...

Antoine

Re: [PATCH net v2 1/3] net: fix race conditions in xps by locking the maps and dev->tc_num

2020-12-23 Thread Antoine Tenart

Hi Jakub,

Quoting Jakub Kicinski (2020-12-23 19:27:29)
> On Tue, 22 Dec 2020 08:12:28 -0800 Alexander Duyck wrote:
> > On Tue, Dec 22, 2020 at 1:21 AM Antoine Tenart  wrote:
> > 
> > > If I understood correctly, as things are a bit too complex now, you
> > > would prefer that we go for the solution proposed in v1?  
> > 
> > Yeah, that is what I am thinking. Basically we just need to make sure
> > the num_tc cannot be updated while we are reading the other values.
> 
> Yeah, okay, as much as I dislike this approach 300 lines may be a little
> too large for stable.
> 
> > > I can still do the code factoring for the 2 sysfs show operations, but
> > > that would then target net-next and would be in a different series. So I
> > > believe we'll use the patches of v1, unmodified.  
> 
> Are you saying just patch 3 for net-next?

The idea would be to:

- For net, to take all 4 patches from v1. If so, do I need to resend
  them?

- For net-next, to resend patches 2 and 3 from v2 (they'll have to be
  slightly reworked, to take into account the review from Alexander and
  the rtnl lock). The patches can be sent once the ones for net land in
  net-next.

> We need to do something about the fact that with sysfs taking
> rtnl_lock xps_map_mutex is now entirely pointless. I guess its value
> eroded over the years since Tom's initial design so we can just get
> rid of it.

We should be able to remove the mutex (I'll double check as more
functions are involved). If so, I can send a patch to net-next.

Thanks!
Antoine

Re: [PATCH net v2 1/3] net: fix race conditions in xps by locking the maps and dev->tc_num

2020-12-22 Thread Antoine Tenart

Hello Alexander, Jakub,

Quoting Alexander Duyck (2020-12-22 00:21:57)
> 
> Looking over this patch it seems kind of obvious that extending the
> xps_map_mutex is making things far more complex then they need to be.
> 
> Applying the rtnl_mutex would probably be much simpler. Although as I
> think you have already discovered we need to apply it to the store,
> and show for this interface. In addition we probably need to perform
> similar locking around traffic_class_show in order to prevent it from
> generating a similar error.

I don't think we have the same kind of issues with traffic_class_show:
dev->num_tc is used, but not for navigating through the map. Protecting
only a single read wouldn't change much. We can still think about what
could go wrong here without the lock, but that is not related to this
series of fixes.

If I understood correctly, as things are a bit too complex now, you
would prefer that we go for the solution proposed in v1?

I can still do the code factoring for the 2 sysfs show operations, but
that would then target net-next and would be in a different series. So I
believe we'll use the patches of v1, unmodified.

Jakub, should I send a v3 then?

Thanks!
Antoine

Re: [PATCH net v2 2/3] net: move the xps cpus retrieval out of net-sysfs

2020-12-22 Thread Antoine Tenart

Hi Alexander,

Quoting Alexander Duyck (2020-12-21 23:33:15)
> 
> One thing I might change is to actually bump this patch up in the
> patch set as I think it would probably make things a bit cleaner to
> read as you are going to be moving the other functions to this pattern
> as well.

Right. If it were not for net (vs net-next), I would have split the
patches a bit differently to make things easier to review. But those
patches are fixes and can be backported to older kernel versions.
They're fixing 2 commits that were introduced in different versions, so
this patch has to be made before the next one, as it is fixing older
kernels.

(I also did not give an hint in the commit message to what is done in
patch 3 for the same reason. But I agree that's arguable.)

Thanks,
Antoine

[PATCH net v2 3/3] net: move the xps rxqs retrieval out of net-sysfs

2020-12-21 Thread Antoine Tenart

Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
protected by the xps_map mutex, to avoid possible race conditions when
dev->num_tc is updated while the map is accessed. Make use of the now
available netif_show_xps_queue helper which does just that.

This also helps to keep xps_cpus_show and xps_rxqs_show synced as their
logic is the same (as in __netif_set_xps_queue, the function allocating
and setting them up).

Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx 
queue")
Signed-off-by: Antoine Tenart 
---
 include/linux/netdevice.h |  5 +++--
 net/core/dev.c| 15 ++-
 net/core/net-sysfs.c  | 37 ++---
 3 files changed, 19 insertions(+), 38 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bfd6cfa3ea90..5c3e16464c3f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3672,7 +3672,7 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map);
 int netif_show_xps_queue(struct net_device *dev, unsigned long **mask,
-u16 index);
+u16 index, bool is_rxqs_map);
 
 /**
  * netif_attr_test_mask - Test a CPU or Rx queue set in a mask
@@ -3773,7 +3773,8 @@ static inline int __netif_set_xps_queue(struct net_device 
*dev,
 }
 
 static inline int netif_show_xps_queue(struct net_device *dev,
-  unsigned long **mask, u16 index)
+  unsigned long **mask, u16 index,
+  bool is_rxqs_map)
 {
return 0;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index a0257da4160a..e5cc2939e4d9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2832,7 +2832,7 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 int netif_show_xps_queue(struct net_device *dev, unsigned long **mask,
-u16 index)
+u16 index, bool is_rxqs_map)
 {
const unsigned long *possible_mask = NULL;
int j, num_tc = 1, tc = 0, ret = 0;
@@ -2859,12 +2859,17 @@ int netif_show_xps_queue(struct net_device *dev, 
unsigned long **mask,
}
}
 
-   dev_maps = rcu_dereference(dev->xps_cpus_map);
+   if (is_rxqs_map) {
+   dev_maps = rcu_dereference(dev->xps_rxqs_map);
+   nr_ids = dev->num_rx_queues;
+   } else {
+   dev_maps = rcu_dereference(dev->xps_cpus_map);
+   nr_ids = nr_cpu_ids;
+   if (num_possible_cpus() > 1)
+   possible_mask = cpumask_bits(cpu_possible_mask);
+   }
if (!dev_maps)
goto out_no_map;
-   nr_ids = nr_cpu_ids;
-   if (num_possible_cpus() > 1)
-   possible_mask = cpumask_bits(cpu_possible_mask);
 
for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 j < nr_ids;) {
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 29ee69b67972..4f58b38dfc7d 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1329,7 +1329,7 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue, 
char *buf)
if (!mask)
return -ENOMEM;
 
-   ret = netif_show_xps_queue(dev, &mask, index);
+   ret = netif_show_xps_queue(dev, &mask, index, false);
if (ret) {
bitmap_free(mask);
return ret;
@@ -1379,45 +1379,20 @@ static struct netdev_queue_attribute xps_cpus_attribute 
__ro_after_init
 static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
 {
struct net_device *dev = queue->dev;
-   struct xps_dev_maps *dev_maps;
unsigned long *mask, index;
-   int j, len, num_tc = 1, tc = 0;
+   int len, ret;
 
index = get_netdev_queue_index(queue);
 
-   if (dev->num_tc) {
-   num_tc = dev->num_tc;
-   tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0)
-   return -EINVAL;
-   }
mask = bitmap_zalloc(dev->num_rx_queues, GFP_KERNEL);
if (!mask)
return -ENOMEM;
 
-   rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_rxqs_map);
-   if (!dev_maps)
-   goto out_no_maps;
-
-   for (j = -1; j = netif_attrmask_next(j, NULL, dev->num_rx_queues),
-j < dev->num_rx_queues;) {
-   int i, tci = j * num_tc + tc;
-   struct xps_map *map;
-
-   map = rcu_dereference(dev_maps->attr_map[tci]);
-   if (!map)
-   continue;
-
-   for (i = map->len; i--;) {
-

[PATCH net v2 2/3] net: move the xps cpus retrieval out of net-sysfs

2020-12-21 Thread Antoine Tenart

Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
protected by the xps_map mutex, to avoid possible race conditions when
dev->num_tc is updated while the map is accessed. This patch moves the
logic accessing dev->xps_cpu_map and dev->num_tc to net/core/dev.c,
where the xps_map mutex is defined and used.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart 
---
 include/linux/netdevice.h |  8 ++
 net/core/dev.c| 59 +++
 net/core/net-sysfs.c  | 54 ---
 3 files changed, 79 insertions(+), 42 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 259be67644e3..bfd6cfa3ea90 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3671,6 +3671,8 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
u16 index);
 int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
  u16 index, bool is_rxqs_map);
+int netif_show_xps_queue(struct net_device *dev, unsigned long **mask,
+u16 index);
 
 /**
  * netif_attr_test_mask - Test a CPU or Rx queue set in a mask
@@ -3769,6 +3771,12 @@ static inline int __netif_set_xps_queue(struct 
net_device *dev,
 {
return 0;
 }
+
+static inline int netif_show_xps_queue(struct net_device *dev,
+  unsigned long **mask, u16 index)
+{
+   return 0;
+}
 #endif
 
 /**
diff --git a/net/core/dev.c b/net/core/dev.c
index effdb7fee9df..a0257da4160a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2831,6 +2831,65 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
 }
 EXPORT_SYMBOL(netif_set_xps_queue);
 
+int netif_show_xps_queue(struct net_device *dev, unsigned long **mask,
+u16 index)
+{
+   const unsigned long *possible_mask = NULL;
+   int j, num_tc = 1, tc = 0, ret = 0;
+   struct xps_dev_maps *dev_maps;
+   unsigned int nr_ids;
+
+   rcu_read_lock();
+   mutex_lock(&xps_map_mutex);
+
+   if (dev->num_tc) {
+   num_tc = dev->num_tc;
+   if (num_tc < 0) {
+   ret = -EINVAL;
+   goto out_no_map;
+   }
+
+   /* If queue belongs to subordinate dev use its map */
+   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
+   tc = netdev_txq_to_tc(dev, index);
+   if (tc < 0) {
+   ret = -EINVAL;
+   goto out_no_map;
+   }
+   }
+
+   dev_maps = rcu_dereference(dev->xps_cpus_map);
+   if (!dev_maps)
+   goto out_no_map;
+   nr_ids = nr_cpu_ids;
+   if (num_possible_cpus() > 1)
+   possible_mask = cpumask_bits(cpu_possible_mask);
+
+   for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
+j < nr_ids;) {
+   int i, tci = j * num_tc + tc;
+   struct xps_map *map;
+
+   map = rcu_dereference(dev_maps->attr_map[tci]);
+   if (!map)
+   continue;
+
+   for (i = map->len; i--;) {
+   if (map->queues[i] == index) {
+   set_bit(j, *mask);
+   break;
+   }
+   }
+   }
+
+out_no_map:
+   mutex_unlock(&xps_map_mutex);
+   rcu_read_unlock();
+
+   return ret;
+}
+EXPORT_SYMBOL(netif_show_xps_queue);
 #endif
 
 static void __netdev_unbind_sb_channel(struct net_device *dev,
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 999b70c59761..29ee69b67972 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1314,60 +1314,30 @@ static const struct attribute_group dql_group = {
 #endif /* CONFIG_BQL */
 
 #ifdef CONFIG_XPS
-static ssize_t xps_cpus_show(struct netdev_queue *queue,
-char *buf)
+static ssize_t xps_cpus_show(struct netdev_queue *queue, char *buf)
 {
struct net_device *dev = queue->dev;
-   int cpu, len, num_tc = 1, tc = 0;
-   struct xps_dev_maps *dev_maps;
-   cpumask_var_t mask;
-   unsigned long index;
+   unsigned long *mask, index;
+   int len, ret;
 
if (!netif_is_multiqueue(dev))
return -ENOENT;
 
index = get_netdev_queue_index(queue);
 
-   if (dev->num_tc) {
-   /* Do not allow XPS on subordinate device directly */
-   num_tc = dev->num_tc;
-   if (num_tc < 0)
-   return -EINVAL;
-
-   /* If queue belongs to subordinate dev use its map */
-   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev

[PATCH net v2 0/3] net-sysfs: fix race conditions in the xps code

2020-12-21 Thread Antoine Tenart

Hello all,

This series fixes race conditions in the xps code, where out of bound
accesses can occur when dev->num_tc is updated, triggering oops. The
root cause is linked to lock issues. An explanation is given in each of
the commit logs.

Reviews in v1 suggested to use the xps_map_mutex to protect the maps and
their related parameters instead of the rtnl lock. We followed this path
in v2 as it seems a better compromise than taking the rtnl lock.

As a result, patch 1 turned out to be less straight forward as some of
the locking logic in net/core/dev.c related to xps_map_mutex had to be
changed. Patches 2 and 3 are also larger in v2 as code had to be moved
from net/core/net-sysfs.c to net/core/dev.c to take the xps_map_mutex
(however maintainability is improved).

Also, while working on the v2 I stumbled upon another race condition. I
debugged it and the fix is the same as patch 1. I updated its commit log
to describe both races.

Thanks!
Antoine

Antoine Tenart (3):
  net: fix race conditions in xps by locking the maps and dev->tc_num
  net: move the xps cpus retrieval out of net-sysfs
  net: move the xps rxqs retrieval out of net-sysfs

 include/linux/netdevice.h |   9 ++
 net/core/dev.c| 186 +-
 net/core/net-sysfs.c  |  89 --
 3 files changed, 171 insertions(+), 113 deletions(-)

-- 
2.29.2

[PATCH net v2 1/3] net: fix race conditions in xps by locking the maps and dev->tc_num

2020-12-21 Thread Antoine Tenart

Two race conditions can be triggered in xps, resulting in various oops
and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

   - netdev_set_num_tc sets dev->tc_num.

   - netif_set_xps_queue uses dev->tc_num as one of the parameters to
 compute the size of new_dev_maps when allocating it. dev->tc_num is
 also used to access the map, and the compiler may generate code to
 retrieve this field multiple times in the function.

   If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
   is set to a higher value through netdev_set_num_tc, later accesses to
   new_dev_maps in netif_set_xps_queue could lead to accessing memory
   outside of new_dev_maps; triggering an oops.

   One way of triggering this is to set an iface up (for which the
   driver uses netdev_set_num_tc in the open path, such as bnx2x) and
   writing to xps_cpus or xps_rxqs in a concurrent thread. With the
   right timing an oops is triggered.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

   2.1. netdev_set_num_tc starts by resetting the xps queues,
dev->tc_num isn't updated yet.

   2.2. netif_set_xps_queue is called, setting up the maps with the
*old* dev->num_tc.

   2.3. dev->tc_num is updated.

   2.3. Later accesses to the map leads to out of bound accesses and
oops.

   A similar issue can be found with netdev_reset_tc.

   The fix can't be to only link the size of the maps to them, as
   invalid configuration could still occur. The reset then set logic in
   both netdev_set_num_tc and netdev_reset_tc must be protected by a
   lock.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive.

This patch fixes those races by:

- Reworking netif_set_xps_queue by moving the xps_map_mutex up so the
  access of dev->num_tc is done under the lock.

- Using xps_map_mutex in both netdev_set_num_tc and netdev_reset_tc for
  the reset and set logic:

  + As xps_map_mutex was taken in the reset path, netif_reset_xps_queues
had to be reworked to offer an unlocked version (as well as
netdev_unbind_all_sb_channels which calls it).

  + cpus_read_lock was taken in the reset path as well, and is always
taken before xps_map_mutex. It had to be moved out of the unlocked
version as well.

  This is why the patch is a little bit longer, and moves
  netdev_unbind_sb_channel up in the file.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart 
---
 net/core/dev.c | 122 -
 1 file changed, 81 insertions(+), 41 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8fa739259041..effdb7fee9df 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2527,8 +2527,8 @@ static void clean_xps_maps(struct net_device *dev, const 
unsigned long *mask,
}
 }
 
-static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
-  u16 count)
+static void __netif_reset_xps_queues(struct net_device *dev, u16 offset,
+u16 count)
 {
const unsigned long *possible_mask = NULL;
struct xps_dev_maps *dev_maps;
@@ -2537,9 +2537,6 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
if (!static_key_false(&xps_needed))
return;
 
-   cpus_read_lock();
-   mutex_lock(&xps_map_mutex);
-
if (static_key_false(&xps_rxqs_needed)) {
dev_maps = xmap_dereference(dev->xps_rxqs_map);
if (dev_maps) {
@@ -2551,15 +2548,23 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
if (!dev_maps)
-   goto out_no_maps;
+   return;
 
if (num_possible_cpus() > 1)
possible_mask = cpumask_bits(cpu_possible_mask);
nr_ids = nr_cpu_ids;
clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
   false);
+}
+
+static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
+  u16 count)
+{
+   cpus_read_lock();
+   mutex_lock(&xps_map_mutex);
+
+   __netif_reset_xps_queues(dev, offset, count);
 
-out_no_maps:
mutex_unlock(&xps_map_mutex);
cpus_read_unlock();
 }
@@ -2615,27 +2620,32 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
 {
const unsigned long *online_mask = NULL, *possible_mask = NULL;
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
-   int i, j, tci, numa_node_id = -2;
+   int i, j, tci, numa_node_id = -2, ret = 0;
int maps_sz, num_tc = 1, tc = 0;
struct xps_map *map, *new_map;
bool active = false;
unsigned

Re: [PATCH net 1/4] net-sysfs: take the rtnl lock when storing xps_cpus

2020-12-19 Thread Antoine Tenart

Hi Jakub, Alexander,

Quoting Alexander Duyck (2020-12-19 02:41:08)
> On Fri, Dec 18, 2020 at 4:30 PM Jakub Kicinski  wrote:
> >
> > Two things: (a) is the datapath not exposed to a similar problem?
> > __get_xps_queue_idx() uses dev->tc_num in a very similar fashion.
> 
> I think we are shielded from this by the fact that if you change the
> number of tc the Tx path has to be torn down and rebuilt since you are
> normally changing the qdisc configuration anyway.

That's right. But there's nothing preventing users to call functions
using the xps maps in between. There are a few functions being exposed.

One (similar) example of that is another bug I reproduced, were the old
and the new map in __netif_set_xps_queue do not have the same size,
because num_tc was updated in between two calls to this function. The
root cause is the same: the size of the map is not embedded in it and
whenever we access it we can make an out of bound access.

> > Should we perhaps make the "num_tcs" part of the XPS maps which is
> > under RCU protection rather than accessing the netdev copy?

Yes, I have a local patch (untested, still WIP) doing exactly that. The
idea is we can't make sure a num_tc update will trigger an xps
reallocation / reconfiguration of the map; but at least we can make sure
the map won't be accessed out of bounds.

It's a different issue though: not being able to access a map out of
bound once it has been allocated whereas this patch wants to prevent an
update of num_tc while the xps map allocation/setup is in progress.

> So it looks like the issue is the fact that we really need to
> synchronize netdev_reset_tc, netdev_set_tc_queue, and
> netdev_set_num_tc with __netif_set_xps_queue.
> 
> > (b) if we always take rtnl_lock, why have xps_map_mutex? Can we
> > rearrange things so that xps_map_mutex is sufficient?
> 
> It seems like the quick and dirty way would be to look at updating the
> 3 functions I called out so that they were holding the xps_map_mutex
> while they were updating things, and for __netif_set_xps_queue to
> expand out the mutex to include the code starting at "if (dev->num_tc)
> {".

That should do the trick. The only downside is xps_map_mutex is only
defined with CONFIG_XPS while netdev_set_num_tc is not, adding more
ifdef to it. But that's probably a better compromise than taking the
rtnl lock.

Thanks for the review and suggestions!
Antoine

Re: [PATCH net 1/4] net-sysfs: take the rtnl lock when storing xps_cpus

2020-12-18 Thread Antoine Tenart

That build issue seems unrelated to the patch. The series as a whole
builds fine according to the same report, and this code is not modified
by later patches.

It looks a lot like this report from yesterday:
https://www.spinics.net/lists/netdev/msg709132.html

Which also seemed unrelated to the changes:
https://www.spinics.net/lists/netdev/msg709264.html

Thanks!
Antoine

Quoting kernel test robot (2020-12-18 16:27:46)
> Hi Antoine,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on net/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Antoine-Tenart/net-sysfs-fix-race-conditions-in-the-xps-code/20201218-002852
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 
> 3ae32c07815a24ae12de2e7838d9d429ba31e5e0
> config: riscv-randconfig-r014-20201217 (attached as .config)
> compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 
> cee1e7d14f4628d6174b33640d502bff3b54ae45)
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install riscv cross compiling tool for clang build
> # apt-get install binutils-riscv64-linux-gnu
> # 
> https://github.com/0day-ci/linux/commit/f989c3dcbe4d9abd1c6c48b34f08c6c0cd9d44b3
> git remote add linux-review https://github.com/0day-ci/linux
> git fetch --no-tags linux-review 
> Antoine-Tenart/net-sysfs-fix-race-conditions-in-the-xps-code/20201218-002852
> git checkout f989c3dcbe4d9abd1c6c48b34f08c6c0cd9d44b3
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv 
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> 
> Note: the 
> linux-review/Antoine-Tenart/net-sysfs-fix-race-conditions-in-the-xps-code/20201218-002852
>  HEAD 563d144b47845dea594b409ecf22914b9797cd1e builds fine.
>   It only hurts bisectibility.
> 
> All errors (new ones prefixed by >>):
> 
>/tmp/ics932s401-422897.s: Assembler messages:
> >> /tmp/ics932s401-422897.s:260: Error: unrecognized opcode `zext.b a1,s11'
>/tmp/ics932s401-422897.s:362: Error: unrecognized opcode `zext.b a1,s11'
>/tmp/ics932s401-422897.s:518: Error: unrecognized opcode `zext.b a1,s11'
>/tmp/ics932s401-422897.s:637: Error: unrecognized opcode `zext.b a1,s11'
>/tmp/ics932s401-422897.s:774: Error: unrecognized opcode `zext.b a1,s11'
>/tmp/ics932s401-422897.s:893: Error: unrecognized opcode `zext.b a1,s11'
>/tmp/ics932s401-422897.s:1021: Error: unrecognized opcode `zext.b a1,s11'
> >> /tmp/ics932s401-422897.s:1180: Error: unrecognized opcode `zext.b a1,s2'
>clang-12: error: assembler command failed with exit code 1 (use -v to see 
> invocation)
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Re: [PATCH net] net: mvpp2: Add TCAM entry to drop flow control pause frames

2020-12-17 Thread Antoine Tenart

Quoting stef...@marvell.com (2020-12-17 18:45:06)
> From: Stefan Chulski 
> 
> Issue:
> Flow control frame used to pause GoP(MAC) was delivered to the CPU
> and created a load on the CPU. Since XOFF/XON frames are used only
> by MAC, these frames should be dropped inside MAC.
> 
> Fix:
> According to 802.3-2012 - IEEE Standard for Ethernet pause frame
> has unique destination MAC address 01-80-C2-00-00-01.
> Add TCAM parser entry to track and drop pause frames by destination MAC.
> 
> Fixes: db9d7d36eecc ("net: mvpp2: Split the PPv2 driver to a dedicated 
> directory")

Same here, you should go further in the git history.

Also, was that introduced when the TCAM support landed in (overriding
its default configuration?)? Or is that the behaviour since the
beginning? I'm asking because while this could very be a fix, it could
also fall in the improvements category.

Thanks!
Antoine

> Signed-off-by: Stefan Chulski 
> ---
>  drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.c | 34 
> ++
>  drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.h |  2 +-
>  2 files changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.c 
> b/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.c
> index 1a272c2..3a9c747 100644
> --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.c
> +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.c
> @@ -405,6 +405,39 @@ static int mvpp2_prs_tcam_first_free(struct mvpp2 *priv, 
> unsigned char start,
> return -EINVAL;
>  }
>  
> +/* Drop flow control pause frames */
> +static void mvpp2_prs_drop_fc(struct mvpp2 *priv)
> +{
> +   struct mvpp2_prs_entry pe;
> +   unsigned int len;
> +   unsigned char da[ETH_ALEN] = {
> +   0x01, 0x80, 0xC2, 0x00, 0x00, 0x01 };
> +
> +   memset(&pe, 0, sizeof(pe));
> +
> +   /* For all ports - drop flow control frames */
> +   pe.index = MVPP2_PE_FC_DROP;
> +   mvpp2_prs_tcam_lu_set(&pe, MVPP2_PRS_LU_MAC);
> +
> +   /* Set match on DA */
> +   len = ETH_ALEN;
> +   while (len--)
> +   mvpp2_prs_tcam_data_byte_set(&pe, len, da[len], 0xff);
> +
> +   mvpp2_prs_sram_ri_update(&pe, MVPP2_PRS_RI_DROP_MASK,
> +MVPP2_PRS_RI_DROP_MASK);
> +
> +   mvpp2_prs_sram_bits_set(&pe, MVPP2_PRS_SRAM_LU_GEN_BIT, 1);
> +   mvpp2_prs_sram_next_lu_set(&pe, MVPP2_PRS_LU_FLOWS);
> +
> +   /* Mask all ports */
> +   mvpp2_prs_tcam_port_map_set(&pe, MVPP2_PRS_PORT_MASK);
> +
> +   /* Update shadow table and hw entry */
> +   mvpp2_prs_shadow_set(priv, pe.index, MVPP2_PRS_LU_MAC);
> +   mvpp2_prs_hw_write(priv, &pe);
> +}
> +
>  /* Enable/disable dropping all mac da's */
>  static void mvpp2_prs_mac_drop_all_set(struct mvpp2 *priv, int port, bool 
> add)
>  {
> @@ -1168,6 +1201,7 @@ static void mvpp2_prs_mac_init(struct mvpp2 *priv)
> mvpp2_prs_hw_write(priv, &pe);
>  
> /* Create dummy entries for drop all and promiscuous modes */
> +   mvpp2_prs_drop_fc(priv);
> mvpp2_prs_mac_drop_all_set(priv, 0, false);
> mvpp2_prs_mac_promisc_set(priv, 0, MVPP2_PRS_L2_UNI_CAST, false);
> mvpp2_prs_mac_promisc_set(priv, 0, MVPP2_PRS_L2_MULTI_CAST, false);
> diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.h 
> b/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.h
> index e22f6c8..4b68dd3 100644
> --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.h
> +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.h
> @@ -129,7 +129,7 @@
>  #define MVPP2_PE_VID_EDSA_FLTR_DEFAULT (MVPP2_PRS_TCAM_SRAM_SIZE - 7)
>  #define MVPP2_PE_VLAN_DBL  (MVPP2_PRS_TCAM_SRAM_SIZE - 6)
>  #define MVPP2_PE_VLAN_NONE (MVPP2_PRS_TCAM_SRAM_SIZE - 5)
> -/* reserved */
> +#define MVPP2_PE_FC_DROP   (MVPP2_PRS_TCAM_SRAM_SIZE - 4)
>  #define MVPP2_PE_MAC_MC_PROMISCUOUS(MVPP2_PRS_TCAM_SRAM_SIZE - 3)
>  #define MVPP2_PE_MAC_UC_PROMISCUOUS(MVPP2_PRS_TCAM_SRAM_SIZE - 2)
>  #define MVPP2_PE_MAC_NON_PROMISCUOUS   (MVPP2_PRS_TCAM_SRAM_SIZE - 1)
> -- 
> 1.9.1
>

Re: [PATCH net] net: mvpp2: prs: fix PPPoE with ipv6 packet parse

2020-12-17 Thread Antoine Tenart

Hi Stefan,

Quoting stef...@marvell.com (2020-12-17 18:23:11)
> From: Stefan Chulski 
> 
> Current PPPoE+IPv6 entry is jumping to 'next-hdr'
> field and not to 'DIP' field as done for IPv4.
> 
> Fixes: db9d7d36eecc ("net: mvpp2: Split the PPv2 driver to a dedicated 
> directory")

That's not the commit introducing the issue. You can use
`git log --follow` to go further back (or directly pointing to the old
mvpp2.c file).

Thanks!
Antoine

> Reported-by: Liron Himi 
> Signed-off-by: Stefan Chulski 
> ---
>  drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.c 
> b/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.c
> index b9e5b08..1a272c2 100644
> --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.c
> +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_prs.c
> @@ -1655,8 +1655,9 @@ static int mvpp2_prs_pppoe_init(struct mvpp2 *priv)
> mvpp2_prs_sram_next_lu_set(&pe, MVPP2_PRS_LU_IP6);
> mvpp2_prs_sram_ri_update(&pe, MVPP2_PRS_RI_L3_IP6,
>  MVPP2_PRS_RI_L3_PROTO_MASK);
> -   /* Skip eth_type + 4 bytes of IPv6 header */
> -   mvpp2_prs_sram_shift_set(&pe, MVPP2_ETH_TYPE_LEN + 4,
> +   /* Jump to DIP of IPV6 header */
> +   mvpp2_prs_sram_shift_set(&pe, MVPP2_ETH_TYPE_LEN + 8 +
> +MVPP2_MAX_L3_ADDR_SIZE,
>  MVPP2_PRS_SRAM_OP_SEL_SHIFT_ADD);
> /* Set L3 offset */
> mvpp2_prs_sram_offset_set(&pe, MVPP2_PRS_SRAM_UDF_TYPE_L3,
> -- 
> 1.9.1
>

[PATCH net 1/4] net-sysfs: take the rtnl lock when storing xps_cpus

2020-12-17 Thread Antoine Tenart

Callers to netif_set_xps_queue should take the rtnl lock. Failing to do
so can lead to race conditions between netdev_set_num_tc and
netif_set_xps_queue, triggering various oops:

- netif_set_xps_queue uses dev->tc_num as one of the parameters to
  compute the size of new_dev_maps when allocating it. dev->tc_num is
  also used to access the map, and the compiler may generate code to
  retrieve this field multiple times in the function.

- netdev_set_num_tc sets dev->tc_num.

If new_dev_maps is allocated using dev->tc_num and then dev->tc_num is
set to a higher value through netdev_set_num_tc, later accesses to
new_dev_maps in netif_set_xps_queue could lead to accessing memory
outside of new_dev_maps; triggering an oops.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_cpus in a concurrent thread. With the right timing an oops is
triggered.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 999b70c59761..7cc15dec1717 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1396,7 +1396,13 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
return err;
}
 
+   if (!rtnl_trylock()) {
+   free_cpumask_var(mask);
+   return restart_syscall();
+   }
+
err = netif_set_xps_queue(dev, mask, index);
+   rtnl_unlock();
 
free_cpumask_var(mask);
 
-- 
2.29.2

[PATCH net 2/4] net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc

2020-12-17 Thread Antoine Tenart

Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 7cc15dec1717..65886bfbf822 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1317,8 +1317,8 @@ static const struct attribute_group dql_group = {
 static ssize_t xps_cpus_show(struct netdev_queue *queue,
 char *buf)
 {
+   int cpu, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
-   int cpu, len, num_tc = 1, tc = 0;
struct xps_dev_maps *dev_maps;
cpumask_var_t mask;
unsigned long index;
@@ -1328,22 +1328,31 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 
index = get_netdev_queue_index(queue);
 
+   if (!rtnl_trylock())
+   return restart_syscall();
+
if (dev->num_tc) {
/* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
-   if (num_tc < 0)
-   return -EINVAL;
+   if (num_tc < 0) {
+   ret = -EINVAL;
+   goto err_rtnl_unlock;
+   }
 
/* If queue belongs to subordinate dev use its map */
dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
 
tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0)
-   return -EINVAL;
+   if (tc < 0) {
+   ret = -EINVAL;
+   goto err_rtnl_unlock;
+   }
}
 
-   if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
-   return -ENOMEM;
+   if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+   ret = -ENOMEM;
+   goto err_rtnl_unlock;
+   }
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_cpus_map);
@@ -1366,9 +1375,15 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
}
rcu_read_unlock();
 
+   rtnl_unlock();
+
len = snprintf(buf, PAGE_SIZE, "%*pb\n", cpumask_pr_args(mask));
free_cpumask_var(mask);
return len < PAGE_SIZE ? len : -EINVAL;
+
+err_rtnl_unlock:
+   rtnl_unlock();
+   return ret;
 }
 
 static ssize_t xps_cpus_store(struct netdev_queue *queue,
-- 
2.29.2

[PATCH net 3/4] net-sysfs: take the rtnl lock when storing xps_rxqs

2020-12-17 Thread Antoine Tenart

Callers to __netif_set_xps_queue should take the rtnl lock. Failing to
do so can lead to race conditions between netdev_set_num_tc and
__netif_set_xps_queue, triggering various oops:

- __netif_set_xps_queue uses dev->tc_num as one of the parameters to
  compute the size of new_dev_maps when allocating it. dev->tc_num is
  also used to access the map, and the compiler may generate code to
  retrieve this field multiple times in the function.

- netdev_set_num_tc sets dev->tc_num.

If new_dev_maps is allocated using dev->tc_num and then dev->tc_num is
set to a higher value through netdev_set_num_tc, later accesses to
new_dev_maps in __netif_set_xps_queue could lead to accessing memory
outside of new_dev_maps; triggering an oops.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_rxqs in a concurrent thread. With the right timing an oops is
triggered.

Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx 
queue")
Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 65886bfbf822..62ca2f2c0ee6 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1499,10 +1499,17 @@ static ssize_t xps_rxqs_store(struct netdev_queue 
*queue, const char *buf,
return err;
}
 
+   if (!rtnl_trylock()) {
+   bitmap_free(mask);
+   return restart_syscall();
+   }
+
cpus_read_lock();
err = __netif_set_xps_queue(dev, mask, index, true);
cpus_read_unlock();
 
+   rtnl_unlock();
+
bitmap_free(mask);
return err ? : len;
 }
-- 
2.29.2

[PATCH net 0/4] net-sysfs: fix race conditions in the xps code

2020-12-17 Thread Antoine Tenart

Hello all,

This series fixes two similar issues, one with xps_cpus and one with
xps_rxqs, where a race condition can occur between accesses to the
xps_cpus and xps_rxqs maps and the update of dev->num_tc. Those races
lead to accesses outside of the map and oops being triggered. An
explanation is given in each of the commit logs.

Thanks!
Antoine

Antoine Tenart (4):
  net-sysfs: take the rtnl lock when storing xps_cpus
  net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc
  net-sysfs: take the rtnl lock when storing xps_rxqs
  net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc

 net/core/net-sysfs.c | 65 
 1 file changed, 53 insertions(+), 12 deletions(-)

-- 
2.29.2

[PATCH net 4/4] net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc

2020-12-17 Thread Antoine Tenart

Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx 
queue")
Signed-off-by: Antoine Tenart 
---
 net/core/net-sysfs.c | 23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 62ca2f2c0ee6..daf502c13d6d 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1429,22 +1429,29 @@ static struct netdev_queue_attribute xps_cpus_attribute 
__ro_after_init
 
 static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
 {
+   int j, len, ret, num_tc = 1, tc = 0;
struct net_device *dev = queue->dev;
struct xps_dev_maps *dev_maps;
unsigned long *mask, index;
-   int j, len, num_tc = 1, tc = 0;
 
index = get_netdev_queue_index(queue);
 
+   if (!rtnl_trylock())
+   return restart_syscall();
+
if (dev->num_tc) {
num_tc = dev->num_tc;
tc = netdev_txq_to_tc(dev, index);
-   if (tc < 0)
-   return -EINVAL;
+   if (tc < 0) {
+   ret = -EINVAL;
+   goto err_rtnl_unlock;
+   }
}
mask = bitmap_zalloc(dev->num_rx_queues, GFP_KERNEL);
-   if (!mask)
-   return -ENOMEM;
+   if (!mask) {
+   ret = -ENOMEM;
+   goto err_rtnl_unlock;
+   }
 
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_rxqs_map);
@@ -1470,10 +1477,16 @@ static ssize_t xps_rxqs_show(struct netdev_queue 
*queue, char *buf)
 out_no_maps:
rcu_read_unlock();
 
+   rtnl_unlock();
+
len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
bitmap_free(mask);
 
return len < PAGE_SIZE ? len : -EINVAL;
+
+err_rtnl_unlock:
+   rtnl_unlock();
+   return ret;
 }
 
 static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
-- 
2.29.2

1 2 3 4 5 6 7 8 >

1 - 100 of 791 matches

Mail list logo