Re: [PATCH net] openvswitch: fix send of uninitialized stack memory in ct limit reply

2021-04-04 Thread Ilya Maximets
CC: ovs-dev

On 4/4/21 7:50 PM, Ilya Maximets wrote:
> 'struct ovs_zone_limit' has more members than initialized in
> ovs_ct_limit_get_default_limit().  The rest of the memory is a random
> kernel stack content that ends up being sent to userspace.
> 
> Fix that by using designated initializer that will clear all
> non-specified fields.
> 
> Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit")
> Signed-off-by: Ilya Maximets 
> ---
>  net/openvswitch/conntrack.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index c29b0ef1fc27..cadb6a29b285 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -2032,10 +2032,10 @@ static int ovs_ct_limit_del_zone_limit(struct nlattr 
> *nla_zone_limit,
>  static int ovs_ct_limit_get_default_limit(struct ovs_ct_limit_info *info,
> struct sk_buff *reply)
>  {
> - struct ovs_zone_limit zone_limit;
> -
> - zone_limit.zone_id = OVS_ZONE_LIMIT_DEFAULT_ZONE;
> - zone_limit.limit = info->default_limit;
> + struct ovs_zone_limit zone_limit = {
> + .zone_id = OVS_ZONE_LIMIT_DEFAULT_ZONE,
> + .limit   = info->default_limit,
> + };
>  
>   return nla_put_nohdr(reply, sizeof(zone_limit), _limit);
>  }
> 



[PATCH net] openvswitch: fix send of uninitialized stack memory in ct limit reply

2021-04-04 Thread Ilya Maximets
'struct ovs_zone_limit' has more members than initialized in
ovs_ct_limit_get_default_limit().  The rest of the memory is a random
kernel stack content that ends up being sent to userspace.

Fix that by using designated initializer that will clear all
non-specified fields.

Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit")
Signed-off-by: Ilya Maximets 
---
 net/openvswitch/conntrack.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c29b0ef1fc27..cadb6a29b285 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -2032,10 +2032,10 @@ static int ovs_ct_limit_del_zone_limit(struct nlattr 
*nla_zone_limit,
 static int ovs_ct_limit_get_default_limit(struct ovs_ct_limit_info *info,
  struct sk_buff *reply)
 {
-   struct ovs_zone_limit zone_limit;
-
-   zone_limit.zone_id = OVS_ZONE_LIMIT_DEFAULT_ZONE;
-   zone_limit.limit = info->default_limit;
+   struct ovs_zone_limit zone_limit = {
+   .zone_id = OVS_ZONE_LIMIT_DEFAULT_ZONE,
+   .limit   = info->default_limit,
+   };
 
return nla_put_nohdr(reply, sizeof(zone_limit), _limit);
 }
-- 
2.26.2



Re: [PATCH bpf v2] libbpf: fix passing uninitialized bytes to setsockopt

2019-10-13 Thread Ilya Maximets

On 13.10.2019 1:24, Alexei Starovoitov wrote:

On Wed, Oct 09, 2019 at 06:49:29PM +0200, Ilya Maximets wrote:

'struct xdp_umem_reg' has 4 bytes of padding at the end that makes
valgrind complain about passing uninitialized stack memory to the
syscall:

   Syscall param socketcall.setsockopt() points to uninitialised byte(s)
 at 0x4E7AB7E: setsockopt (in /usr/lib64/libc-2.29.so)
 by 0x4BDE035: xsk_umem__create@@LIBBPF_0.0.4 (xsk.c:172)
   Uninitialised value was created by a stack allocation
 at 0x4BDDEBA: xsk_umem__create@@LIBBPF_0.0.4 (xsk.c:140)

Padding bytes appeared after introducing of a new 'flags' field.
memset() is required to clear them.

Fixes: 10d30e301732 ("libbpf: add flags to umem config")
Signed-off-by: Ilya Maximets 
---

Version 2:
   * Struct initializer replaced with explicit memset(). [Andrii]

  tools/lib/bpf/xsk.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index a902838f9fcc..9d5348086203 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -163,6 +163,7 @@ int xsk_umem__create_v0_0_4(struct xsk_umem **umem_ptr, 
void *umem_area,
umem->umem_area = umem_area;
xsk_set_umem_config(>config, usr_config);
  
+	memset(, 0, sizeof(mr));

mr.addr = (uintptr_t)umem_area;
mr.len = size;
mr.chunk_size = umem->config.frame_size;


This was already applied. Why did you resend?



Sorry, it wasn't me.  Looking at the mail delivery chain:

Received: from listssympa-test.colorado.edu (listssympa-test.colorado.edu 
[128.138.129.156])
by spool.mail.gandi.net (Postfix) with ESMTPS id 66E2F780445
for ; Sun, 13 Oct 2019 04:52:14 + (UTC)
Received: from listssympa-test.colorado.edu (localhost [127.0.0.1])
by listssympa-test.colorado.edu (8.15.2/8.15.2/MJC-8.0/sympa) with 
ESMTPS id x9D4pvsL015926
(version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
Sat, 12 Oct 2019 22:51:57 -0600
Received: (from root@localhost)
by listssympa-test.colorado.edu (8.15.2/8.15.2/MJC-8.0/submit) id 
x9D4pujl015885;
Sat, 12 Oct 2019 22:51:56 -0600
Received: from DM5PR03MB3273.namprd03.prod.outlook.com (2603:10b6:a03:54::17) by
 BYAPR03MB4376.namprd03.prod.outlook.com with HTTPS via
 BYAPR02CA0040.NAMPRD02.PROD.OUTLOOK.COM; Wed, 9 Oct 2019 22:04:15 +
Received: from BN6PR03CA0057.namprd03.prod.outlook.com (2603:10b6:404:4c::19) by
 DM5PR03MB3273.namprd03.prod.outlook.com (2603:10b6:4:42::32) with Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
 id 15.20.2347.16; Wed, 9 Oct 2019 17:44:13 +

There is some strange server listssympa-test.colorado.edu.
Looks like someone in colorado.edu is testing stuff on production server.

The simplified delivery chain looks like this:

Me -> relay6-d.mail.gandi.net -> vger.kernel.org -> mx.colorado.edu ->
mail.protection.outlook.com -> namprd03.prod.outlook.com ->
listssympa-test.colorado.edu -> spool.mail.gandi.net -> Me again!

Best regards, Ilya Maximets.


Re: [PATCH bpf] libbpf: fix passing uninitialized bytes to setsockopt

2019-10-13 Thread Ilya Maximets

On 13.10.2019 6:59, Alexei Starovoitov wrote:

On Sat, Oct 12, 2019 at 9:52 PM Ilya Maximets  wrote:


'struct xdp_umem_reg' has 4 bytes of padding at the end that makes
valgrind complain about passing uninitialized stack memory to the
syscall:

   Syscall param socketcall.setsockopt() points to uninitialised byte(s)
 at 0x4E7AB7E: setsockopt (in /usr/lib64/libc-2.29.so)
 by 0x4BDE035: xsk_umem__create@@LIBBPF_0.0.4 (xsk.c:172)
   Uninitialised value was created by a stack allocation
 at 0x4BDDEBA: xsk_umem__create@@LIBBPF_0.0.4 (xsk.c:140)

Padding bytes appeared after introducing of a new 'flags' field.

Fixes: 10d30e301732 ("libbpf: add flags to umem config")
Signed-off-by: Ilya Maximets 


Something is not right with (e|g)mail.
This is 3rd email I got with the same patch.
First one (the one that was applied) was 3 days ago.



I'm sorry.  I don't know why the mail server started re-sending
these e-mails.  I'm receiving them too.

That is strange.

Best regards, Ilya Maximets.


[PATCH bpf v2] libbpf: fix passing uninitialized bytes to setsockopt

2019-10-09 Thread Ilya Maximets
'struct xdp_umem_reg' has 4 bytes of padding at the end that makes
valgrind complain about passing uninitialized stack memory to the
syscall:

  Syscall param socketcall.setsockopt() points to uninitialised byte(s)
at 0x4E7AB7E: setsockopt (in /usr/lib64/libc-2.29.so)
by 0x4BDE035: xsk_umem__create@@LIBBPF_0.0.4 (xsk.c:172)
  Uninitialised value was created by a stack allocation
at 0x4BDDEBA: xsk_umem__create@@LIBBPF_0.0.4 (xsk.c:140)

Padding bytes appeared after introducing of a new 'flags' field.
memset() is required to clear them.

Fixes: 10d30e301732 ("libbpf: add flags to umem config")
Signed-off-by: Ilya Maximets 
---

Version 2:
  * Struct initializer replaced with explicit memset(). [Andrii]

 tools/lib/bpf/xsk.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index a902838f9fcc..9d5348086203 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -163,6 +163,7 @@ int xsk_umem__create_v0_0_4(struct xsk_umem **umem_ptr, 
void *umem_area,
umem->umem_area = umem_area;
xsk_set_umem_config(>config, usr_config);
 
+   memset(, 0, sizeof(mr));
mr.addr = (uintptr_t)umem_area;
mr.len = size;
mr.chunk_size = umem->config.frame_size;
-- 
2.17.1



Re: [PATCH bpf] libbpf: fix passing uninitialized bytes to setsockopt

2019-10-09 Thread Ilya Maximets

On 09.10.2019 18:29, Andrii Nakryiko wrote:

On Wed, Oct 9, 2019 at 8:43 AM Ilya Maximets  wrote:


'struct xdp_umem_reg' has 4 bytes of padding at the end that makes
valgrind complain about passing uninitialized stack memory to the
syscall:

   Syscall param socketcall.setsockopt() points to uninitialised byte(s)
 at 0x4E7AB7E: setsockopt (in /usr/lib64/libc-2.29.so)
 by 0x4BDE035: xsk_umem__create@@LIBBPF_0.0.4 (xsk.c:172)
   Uninitialised value was created by a stack allocation
 at 0x4BDDEBA: xsk_umem__create@@LIBBPF_0.0.4 (xsk.c:140)

Padding bytes appeared after introducing of a new 'flags' field.

Fixes: 10d30e301732 ("libbpf: add flags to umem config")
Signed-off-by: Ilya Maximets 
---
  tools/lib/bpf/xsk.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index a902838f9fcc..26d9db783560 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -139,7 +139,7 @@ int xsk_umem__create_v0_0_4(struct xsk_umem **umem_ptr, 
void *umem_area,
 const struct xsk_umem_config *usr_config)
  {
 struct xdp_mmap_offsets off;
-   struct xdp_umem_reg mr;
+   struct xdp_umem_reg mr = {};


well, guess what, even with this explicit initialization, padding is
not guaranteed to be initialized (and it's sometimes is not in
practice, I ran into such problems), only since C11 standard it is
specified that padding is also zero-initialized. You have to do memset
to 0.


OK. Sure. I'll send v2.




 struct xsk_umem *umem;
 socklen_t optlen;
 void *map;
--
2.17.1



[PATCH bpf] libbpf: fix passing uninitialized bytes to setsockopt

2019-10-09 Thread Ilya Maximets
'struct xdp_umem_reg' has 4 bytes of padding at the end that makes
valgrind complain about passing uninitialized stack memory to the
syscall:

  Syscall param socketcall.setsockopt() points to uninitialised byte(s)
at 0x4E7AB7E: setsockopt (in /usr/lib64/libc-2.29.so)
by 0x4BDE035: xsk_umem__create@@LIBBPF_0.0.4 (xsk.c:172)
  Uninitialised value was created by a stack allocation
at 0x4BDDEBA: xsk_umem__create@@LIBBPF_0.0.4 (xsk.c:140)

Padding bytes appeared after introducing of a new 'flags' field.

Fixes: 10d30e301732 ("libbpf: add flags to umem config")
Signed-off-by: Ilya Maximets 
---
 tools/lib/bpf/xsk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index a902838f9fcc..26d9db783560 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -139,7 +139,7 @@ int xsk_umem__create_v0_0_4(struct xsk_umem **umem_ptr, 
void *umem_area,
const struct xsk_umem_config *usr_config)
 {
struct xdp_mmap_offsets off;
-   struct xdp_umem_reg mr;
+   struct xdp_umem_reg mr = {};
struct xsk_umem *umem;
socklen_t optlen;
void *map;
-- 
2.17.1



Re: [PATCH net v3] ixgbe: fix double clean of tx descriptors with xdp

2019-08-27 Thread Ilya Maximets
On 26.08.2019 16:40, Maciej Fijalkowski wrote:
> On Thu, 22 Aug 2019 20:12:37 +0300
> Ilya Maximets  wrote:
> 
>> Tx code doesn't clear the descriptors' status after cleaning.
>> So, if the budget is larger than number of used elems in a ring, some
>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>> prod_tail far beyond the prod_head breaking the completion queue ring.
>>
>> Fix that by limiting the number of descriptors to clean by the number
>> of used descriptors in the tx ring.
>>
>> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
>> 'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
>> 'next_to_clean' and 'next_to_use' indexes.
>>
>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>> Signed-off-by: Ilya Maximets 
>> ---
>>
>> Version 3:
>>   * Reverted some refactoring made for v2.
>>   * Eliminated 'budget' for tx clean.
>>   * prefetch returned.
>>
>> Version 2:
>>   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
>> 'ixgbe_xsk_clean_tx_ring()'.
>>
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 29 
>>  1 file changed, 11 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c 
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> index 6b609553329f..a3b6d8c89127 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> @@ -633,19 +633,17 @@ static void ixgbe_clean_xdp_tx_buffer(struct 
>> ixgbe_ring *tx_ring,
>>  bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>>  struct ixgbe_ring *tx_ring, int napi_budget)
> 
> While you're at it, can you please as well remove the 'napi_budget' argument?
> It wasn't used at all even before your patch.

As you mentioned, this is not related to current patch and this patch doesn't
touch these particular lines of code.  So, I think it's better to make a
separate patch for this if you think it's needed.

> 
> I'm jumping late in, but I was really wondering and hesitated with taking
> part in discussion since the v1 of this patch - can you elaborate why simply
> clearing the DD bit wasn't sufficient?

Clearing the DD bit will end up driver and hardware writing to close memory
locations at the same time leading to cache trashing and poor performance.

Anyway additional write is unnecessary, because we know exactly which 
descriptors
we need to check.

Best regards, Ilya Maximets.

> 
> Maciej
> 
>>  {
>> +u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
>>  unsigned int total_packets = 0, total_bytes = 0;
>> -u32 i = tx_ring->next_to_clean, xsk_frames = 0;
>> -unsigned int budget = q_vector->tx.work_limit;
>>  struct xdp_umem *umem = tx_ring->xsk_umem;
>>  union ixgbe_adv_tx_desc *tx_desc;
>>  struct ixgbe_tx_buffer *tx_bi;
>> -bool xmit_done;
>> +u32 xsk_frames = 0;
>>  
>> -tx_bi = _ring->tx_buffer_info[i];
>> -tx_desc = IXGBE_TX_DESC(tx_ring, i);
>> -i -= tx_ring->count;
>> +tx_bi = _ring->tx_buffer_info[ntc];
>> +tx_desc = IXGBE_TX_DESC(tx_ring, ntc);
>>  
>> -do {
>> +while (ntc != ntu) {
>>  if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
>>  break;
>>  
>> @@ -661,22 +659,18 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector 
>> *q_vector,
>>  
>>  tx_bi++;
>>  tx_desc++;
>> -i++;
>> -if (unlikely(!i)) {
>> -i -= tx_ring->count;
>> +ntc++;
>> +if (unlikely(ntc == tx_ring->count)) {
>> +ntc = 0;
>>  tx_bi = tx_ring->tx_buffer_info;
>>  tx_desc = IXGBE_TX_DESC(tx_ring, 0);
>>  }
>>  
>>  /* issue prefetch for next Tx descriptor */
>>  prefetch(tx_desc);
>> +}
>>  
>> -/* update budget accounting */
>> -budget--;
>> -} while (likely(budget));
>> -
>> -i += tx_ring->count;
>> -tx_ring->next_to_clean = i;
>> +tx_ring->next_to_clean = ntc;
>>  
>>  u64_stats_update_begin(_ring->syncp);
>>  tx_ring->stats.bytes += total_bytes;
>> @@ -688,8 +682,7 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector 
>> *q_vector,
>>  if (xsk_frames)
>>  xsk_umem_complete_tx(umem, xsk_frames);
>>  
>> -xmit_done = ixgbe_xmit_zc(tx_ring, q_vector->tx.work_limit);
>> -return budget > 0 && xmit_done;
>> +return ixgbe_xmit_zc(tx_ring, q_vector->tx.work_limit);
>>  }
>>  
>>  int ixgbe_xsk_async_xmit(struct net_device *dev, u32 qid)
> 
> 
> 


[PATCH net v3] ixgbe: fix double clean of tx descriptors with xdp

2019-08-22 Thread Ilya Maximets
Tx code doesn't clear the descriptors' status after cleaning.
So, if the budget is larger than number of used elems in a ring, some
descriptors will be accounted twice and xsk_umem_complete_tx will move
prod_tail far beyond the prod_head breaking the completion queue ring.

Fix that by limiting the number of descriptors to clean by the number
of used descriptors in the tx ring.

'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
'next_to_clean' and 'next_to_use' indexes.

Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
Signed-off-by: Ilya Maximets 
---

Version 3:
  * Reverted some refactoring made for v2.
  * Eliminated 'budget' for tx clean.
  * prefetch returned.

Version 2:
  * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
'ixgbe_xsk_clean_tx_ring()'.

 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 29 
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 6b609553329f..a3b6d8c89127 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -633,19 +633,17 @@ static void ixgbe_clean_xdp_tx_buffer(struct ixgbe_ring 
*tx_ring,
 bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
struct ixgbe_ring *tx_ring, int napi_budget)
 {
+   u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
unsigned int total_packets = 0, total_bytes = 0;
-   u32 i = tx_ring->next_to_clean, xsk_frames = 0;
-   unsigned int budget = q_vector->tx.work_limit;
struct xdp_umem *umem = tx_ring->xsk_umem;
union ixgbe_adv_tx_desc *tx_desc;
struct ixgbe_tx_buffer *tx_bi;
-   bool xmit_done;
+   u32 xsk_frames = 0;
 
-   tx_bi = _ring->tx_buffer_info[i];
-   tx_desc = IXGBE_TX_DESC(tx_ring, i);
-   i -= tx_ring->count;
+   tx_bi = _ring->tx_buffer_info[ntc];
+   tx_desc = IXGBE_TX_DESC(tx_ring, ntc);
 
-   do {
+   while (ntc != ntu) {
if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
break;
 
@@ -661,22 +659,18 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector 
*q_vector,
 
tx_bi++;
tx_desc++;
-   i++;
-   if (unlikely(!i)) {
-   i -= tx_ring->count;
+   ntc++;
+   if (unlikely(ntc == tx_ring->count)) {
+   ntc = 0;
tx_bi = tx_ring->tx_buffer_info;
tx_desc = IXGBE_TX_DESC(tx_ring, 0);
}
 
/* issue prefetch for next Tx descriptor */
prefetch(tx_desc);
+   }
 
-   /* update budget accounting */
-   budget--;
-   } while (likely(budget));
-
-   i += tx_ring->count;
-   tx_ring->next_to_clean = i;
+   tx_ring->next_to_clean = ntc;
 
u64_stats_update_begin(_ring->syncp);
tx_ring->stats.bytes += total_bytes;
@@ -688,8 +682,7 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
if (xsk_frames)
xsk_umem_complete_tx(umem, xsk_frames);
 
-   xmit_done = ixgbe_xmit_zc(tx_ring, q_vector->tx.work_limit);
-   return budget > 0 && xmit_done;
+   return ixgbe_xmit_zc(tx_ring, q_vector->tx.work_limit);
 }
 
 int ixgbe_xsk_async_xmit(struct net_device *dev, u32 qid)
-- 
2.17.1



Re: [PATCH net v2] ixgbe: fix double clean of tx descriptors with xdp

2019-08-22 Thread Ilya Maximets
On 22.08.2019 19:38, Alexander Duyck wrote:
> On Thu, Aug 22, 2019 at 5:30 AM Ilya Maximets  wrote:
>>
>> Tx code doesn't clear the descriptors' status after cleaning.
>> So, if the budget is larger than number of used elems in a ring, some
>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>
>> Fix that by limiting the number of descriptors to clean by the number
>> of used descriptors in the tx ring.
>>
>> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
>> 'ixgbe_xsk_clean_tx_ring()' since we don't need most of the
>> complications implemented in the regular 'ixgbe_clean_tx_irq()'
>> and we're allowed to directly use 'next_to_clean' and 'next_to_use'
>> indexes.
>>
>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>> Signed-off-by: Ilya Maximets 
>> ---
>>
>> Version 2:
>>   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
>> 'ixgbe_xsk_clean_tx_ring()'.
>>
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 34 
>>  1 file changed, 13 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c 
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> index 6b609553329f..d1297660e14a 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> @@ -633,22 +633,23 @@ static void ixgbe_clean_xdp_tx_buffer(struct 
>> ixgbe_ring *tx_ring,
>>  bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>> struct ixgbe_ring *tx_ring, int napi_budget)
>>  {
>> +   u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
>> unsigned int total_packets = 0, total_bytes = 0;
>> -   u32 i = tx_ring->next_to_clean, xsk_frames = 0;
>> unsigned int budget = q_vector->tx.work_limit;
>> struct xdp_umem *umem = tx_ring->xsk_umem;
>> -   union ixgbe_adv_tx_desc *tx_desc;
>> -   struct ixgbe_tx_buffer *tx_bi;
>> +   u32 xsk_frames = 0;
>> bool xmit_done;
>>
>> -   tx_bi = _ring->tx_buffer_info[i];
>> -   tx_desc = IXGBE_TX_DESC(tx_ring, i);
>> -   i -= tx_ring->count;
>> +   while (likely(ntc != ntu && budget)) {
> 
> I would say you can get rid of budget entirely. It was only really
> needed for the regular Tx case where you can have multiple CPUs
> feeding a single Tx queue and causing a stall. Since we have a 1:1
> mapping we should never have more than the Rx budget worth of packets
> to really process. In addition we can only make one pass through the
> ring since the ntu value is not updated while running the loop.

OK. Will remove.

> 
>> +   union ixgbe_adv_tx_desc *tx_desc;
>> +   struct ixgbe_tx_buffer *tx_bi;
>> +
>> +   tx_desc = IXGBE_TX_DESC(tx_ring, ntc);
>>
>> -   do {
>> if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
>> break;
>>
>> +   tx_bi = _ring->tx_buffer_info[ntc];
> 
> Please don't move this logic into the loop. We were intentionally
> processing this outside of the loop once and then just doing the
> increments because it is faster that way. It takes several operations
> to compute tx_bi based on ntc, whereas just incrementing is a single
> operation.

OK.

> 
>> total_bytes += tx_bi->bytecount;
>> total_packets += tx_bi->gso_segs;
>>
>> @@ -659,24 +660,15 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector 
>> *q_vector,
>>
>> tx_bi->xdpf = NULL;
>>
>> -   tx_bi++;
>> -   tx_desc++;
>> -   i++;
>> -   if (unlikely(!i)) {
>> -   i -= tx_ring->count;
> 
> So these two lines can probably just be replaced by:
> if (unlikely(ntc == tx_ring->count)) {
> ntc = 0;

Sure.

> 
>> -   tx_bi = tx_ring->tx_buffer_info;
>> -   tx_desc = IXGBE_TX_DESC(tx_ring, 0);
>> -   }
>> -
>> -   /* issue prefetch for next Tx descriptor */
>> -   prefetch(tx_desc);
> 
> Did you just drop the prefetch?

I'll keep the prefetch in v3 because, as you fairly mentioned, it's not
related to this patch. However, I'm not sure if this prefetch makes any
sense here, because there is only on

Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp

2019-08-22 Thread Ilya Maximets
On 22.08.2019 19:07, William Tu wrote:
> On Thu, Aug 22, 2019 at 1:17 AM Ilya Maximets  wrote:
>>
>> On 22.08.2019 0:38, William Tu wrote:
>>> On Wed, Aug 21, 2019 at 9:57 AM Alexander Duyck
>>>  wrote:
>>>>
>>>> On Wed, Aug 21, 2019 at 9:22 AM Ilya Maximets  
>>>> wrote:
>>>>>
>>>>> On 21.08.2019 4:17, Alexander Duyck wrote:
>>>>>> On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets  
>>>>>> wrote:
>>>>>>>
>>>>>>> On 20.08.2019 18:35, Alexander Duyck wrote:
>>>>>>>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets  
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Tx code doesn't clear the descriptor status after cleaning.
>>>>>>>>> So, if the budget is larger than number of used elems in a ring, some
>>>>>>>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>>>>>>>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>>>>>>>>
>>>>>>>>> Fix that by limiting the number of descriptors to clean by the number
>>>>>>>>> of used descriptors in the tx ring.
>>>>>>>>>
>>>>>>>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>>>>>>>>> Signed-off-by: Ilya Maximets 
>>>>>>>>
>>>>>>>> I'm not sure this is the best way to go. My preference would be to
>>>>>>>> have something in the ring that would prevent us from racing which I
>>>>>>>> don't think this really addresses. I am pretty sure this code is safe
>>>>>>>> on x86 but I would be worried about weak ordered systems such as
>>>>>>>> PowerPC.
>>>>>>>>
>>>>>>>> It might make sense to look at adding the eop_desc logic like we have
>>>>>>>> in the regular path with a proper barrier before we write it and after
>>>>>>>> we read it. So for example we could hold of on writing the bytecount
>>>>>>>> value until the end of an iteration and call smp_wmb before we write
>>>>>>>> it. Then on the cleanup we could read it and if it is non-zero we take
>>>>>>>> an smp_rmb before proceeding further to process the Tx descriptor and
>>>>>>>> clearing the value. Otherwise this code is going to just keep popping
>>>>>>>> up with issues.
>>>>>>>
>>>>>>> But, unlike regular case, xdp zero-copy xmit and clean for particular
>>>>>>> tx ring always happens in the same NAPI context and even on the same
>>>>>>> CPU core.
>>>>>>>
>>>>>>> I saw the 'eop_desc' manipulations in regular case and yes, we could
>>>>>>> use 'next_to_watch' field just as a flag of descriptor existence,
>>>>>>> but it seems unnecessarily complicated. Am I missing something?
>>>>>>>
>>>>>>
>>>>>> So is it always in the same NAPI context?. I forgot, I was thinking
>>>>>> that somehow the socket could possibly make use of XDP for transmit.
>>>>>
>>>>> AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
>>>>> is used in zero-copy mode. Real xmit happens inside
>>>>> ixgbe_poll()
>>>>>  -> ixgbe_clean_xdp_tx_irq()
>>>>> -> ixgbe_xmit_zc()
>>>>>
>>>>> This should be not possible to bound another XDP socket to the same netdev
>>>>> queue.
>>>>>
>>>>> It also possible to xmit frames in xdp_ring while performing 
>>>>> XDP_TX/REDIRECT
>>>>> actions. REDIRECT could happen from different netdev with different NAPI
>>>>> context, but this operation is bound to specific CPU core and each core 
>>>>> has
>>>>> its own xdp_ring.
>>>>>
>>>>> However, I'm not an expert here.
>>>>> Björn, maybe you could comment on this?
>>>>>
>>>>>>
>>>>>> As far as the logic to use I would be good with just using a value you
>>>>>> are already setting such as the bytecount value. All that woul

[PATCH net v2] ixgbe: fix double clean of tx descriptors with xdp

2019-08-22 Thread Ilya Maximets
Tx code doesn't clear the descriptors' status after cleaning.
So, if the budget is larger than number of used elems in a ring, some
descriptors will be accounted twice and xsk_umem_complete_tx will move
prod_tail far beyond the prod_head breaking the comletion queue ring.

Fix that by limiting the number of descriptors to clean by the number
of used descriptors in the tx ring.

'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
'ixgbe_xsk_clean_tx_ring()' since we don't need most of the
complications implemented in the regular 'ixgbe_clean_tx_irq()'
and we're allowed to directly use 'next_to_clean' and 'next_to_use'
indexes.

Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
Signed-off-by: Ilya Maximets 
---

Version 2:
  * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
'ixgbe_xsk_clean_tx_ring()'.

 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 34 
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 6b609553329f..d1297660e14a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -633,22 +633,23 @@ static void ixgbe_clean_xdp_tx_buffer(struct ixgbe_ring 
*tx_ring,
 bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
struct ixgbe_ring *tx_ring, int napi_budget)
 {
+   u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
unsigned int total_packets = 0, total_bytes = 0;
-   u32 i = tx_ring->next_to_clean, xsk_frames = 0;
unsigned int budget = q_vector->tx.work_limit;
struct xdp_umem *umem = tx_ring->xsk_umem;
-   union ixgbe_adv_tx_desc *tx_desc;
-   struct ixgbe_tx_buffer *tx_bi;
+   u32 xsk_frames = 0;
bool xmit_done;
 
-   tx_bi = _ring->tx_buffer_info[i];
-   tx_desc = IXGBE_TX_DESC(tx_ring, i);
-   i -= tx_ring->count;
+   while (likely(ntc != ntu && budget)) {
+   union ixgbe_adv_tx_desc *tx_desc;
+   struct ixgbe_tx_buffer *tx_bi;
+
+   tx_desc = IXGBE_TX_DESC(tx_ring, ntc);
 
-   do {
if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
break;
 
+   tx_bi = _ring->tx_buffer_info[ntc];
total_bytes += tx_bi->bytecount;
total_packets += tx_bi->gso_segs;
 
@@ -659,24 +660,15 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector 
*q_vector,
 
tx_bi->xdpf = NULL;
 
-   tx_bi++;
-   tx_desc++;
-   i++;
-   if (unlikely(!i)) {
-   i -= tx_ring->count;
-   tx_bi = tx_ring->tx_buffer_info;
-   tx_desc = IXGBE_TX_DESC(tx_ring, 0);
-   }
-
-   /* issue prefetch for next Tx descriptor */
-   prefetch(tx_desc);
+   ntc++;
+   if (unlikely(ntc == tx_ring->count))
+   ntc = 0;
 
/* update budget accounting */
budget--;
-   } while (likely(budget));
+   }
 
-   i += tx_ring->count;
-   tx_ring->next_to_clean = i;
+   tx_ring->next_to_clean = ntc;
 
u64_stats_update_begin(_ring->syncp);
tx_ring->stats.bytes += total_bytes;
-- 
2.17.1



Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp

2019-08-22 Thread Ilya Maximets
On 22.08.2019 0:38, William Tu wrote:
> On Wed, Aug 21, 2019 at 9:57 AM Alexander Duyck
>  wrote:
>>
>> On Wed, Aug 21, 2019 at 9:22 AM Ilya Maximets  wrote:
>>>
>>> On 21.08.2019 4:17, Alexander Duyck wrote:
>>>> On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets  
>>>> wrote:
>>>>>
>>>>> On 20.08.2019 18:35, Alexander Duyck wrote:
>>>>>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets  
>>>>>> wrote:
>>>>>>>
>>>>>>> Tx code doesn't clear the descriptor status after cleaning.
>>>>>>> So, if the budget is larger than number of used elems in a ring, some
>>>>>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>>>>>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>>>>>>
>>>>>>> Fix that by limiting the number of descriptors to clean by the number
>>>>>>> of used descriptors in the tx ring.
>>>>>>>
>>>>>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>>>>>>> Signed-off-by: Ilya Maximets 
>>>>>>
>>>>>> I'm not sure this is the best way to go. My preference would be to
>>>>>> have something in the ring that would prevent us from racing which I
>>>>>> don't think this really addresses. I am pretty sure this code is safe
>>>>>> on x86 but I would be worried about weak ordered systems such as
>>>>>> PowerPC.
>>>>>>
>>>>>> It might make sense to look at adding the eop_desc logic like we have
>>>>>> in the regular path with a proper barrier before we write it and after
>>>>>> we read it. So for example we could hold of on writing the bytecount
>>>>>> value until the end of an iteration and call smp_wmb before we write
>>>>>> it. Then on the cleanup we could read it and if it is non-zero we take
>>>>>> an smp_rmb before proceeding further to process the Tx descriptor and
>>>>>> clearing the value. Otherwise this code is going to just keep popping
>>>>>> up with issues.
>>>>>
>>>>> But, unlike regular case, xdp zero-copy xmit and clean for particular
>>>>> tx ring always happens in the same NAPI context and even on the same
>>>>> CPU core.
>>>>>
>>>>> I saw the 'eop_desc' manipulations in regular case and yes, we could
>>>>> use 'next_to_watch' field just as a flag of descriptor existence,
>>>>> but it seems unnecessarily complicated. Am I missing something?
>>>>>
>>>>
>>>> So is it always in the same NAPI context?. I forgot, I was thinking
>>>> that somehow the socket could possibly make use of XDP for transmit.
>>>
>>> AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
>>> is used in zero-copy mode. Real xmit happens inside
>>> ixgbe_poll()
>>>  -> ixgbe_clean_xdp_tx_irq()
>>> -> ixgbe_xmit_zc()
>>>
>>> This should be not possible to bound another XDP socket to the same netdev
>>> queue.
>>>
>>> It also possible to xmit frames in xdp_ring while performing XDP_TX/REDIRECT
>>> actions. REDIRECT could happen from different netdev with different NAPI
>>> context, but this operation is bound to specific CPU core and each core has
>>> its own xdp_ring.
>>>
>>> However, I'm not an expert here.
>>> Björn, maybe you could comment on this?
>>>
>>>>
>>>> As far as the logic to use I would be good with just using a value you
>>>> are already setting such as the bytecount value. All that would need
>>>> to happen is to guarantee that the value is cleared in the Tx path. So
>>>> if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
>>>> theoretically just use that as well to flag that a descriptor has been
>>>> populated and is ready to be cleaned. Assuming the logic about this
>>>> all being in the same NAPI context anyway you wouldn't need to mess
>>>> with the barrier stuff I mentioned before.
>>>
>>> Checking the number of used descs, i.e. next_to_use - next_to_clean,
>>> makes iteration in this function logically equal to the iteration inside
>>> 'ixgbe_xsk_clean_tx_ring()'. Do you think we need to change the later
>>> f

Re: [Intel-wired-lan] [PATCH net] ixgbe: fix double clean of tx descriptors with xdp

2019-08-22 Thread Ilya Maximets
On 22.08.2019 10:12, Björn Töpel wrote:
> On Wed, 21 Aug 2019 at 18:57, Alexander Duyck  
> wrote:
>>
>> On Wed, Aug 21, 2019 at 9:22 AM Ilya Maximets  wrote:
>>>
>>> On 21.08.2019 4:17, Alexander Duyck wrote:
>>>> On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets  
>>>> wrote:
>>>>>
>>>>> On 20.08.2019 18:35, Alexander Duyck wrote:
>>>>>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets  
>>>>>> wrote:
>>>>>>>
>>>>>>> Tx code doesn't clear the descriptor status after cleaning.
>>>>>>> So, if the budget is larger than number of used elems in a ring, some
>>>>>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>>>>>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>>>>>>
>>>>>>> Fix that by limiting the number of descriptors to clean by the number
>>>>>>> of used descriptors in the tx ring.
>>>>>>>
>>>>>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>>>>>>> Signed-off-by: Ilya Maximets 
>>>>>>
>>>>>> I'm not sure this is the best way to go. My preference would be to
>>>>>> have something in the ring that would prevent us from racing which I
>>>>>> don't think this really addresses. I am pretty sure this code is safe
>>>>>> on x86 but I would be worried about weak ordered systems such as
>>>>>> PowerPC.
>>>>>>
>>>>>> It might make sense to look at adding the eop_desc logic like we have
>>>>>> in the regular path with a proper barrier before we write it and after
>>>>>> we read it. So for example we could hold of on writing the bytecount
>>>>>> value until the end of an iteration and call smp_wmb before we write
>>>>>> it. Then on the cleanup we could read it and if it is non-zero we take
>>>>>> an smp_rmb before proceeding further to process the Tx descriptor and
>>>>>> clearing the value. Otherwise this code is going to just keep popping
>>>>>> up with issues.
>>>>>
>>>>> But, unlike regular case, xdp zero-copy xmit and clean for particular
>>>>> tx ring always happens in the same NAPI context and even on the same
>>>>> CPU core.
>>>>>
>>>>> I saw the 'eop_desc' manipulations in regular case and yes, we could
>>>>> use 'next_to_watch' field just as a flag of descriptor existence,
>>>>> but it seems unnecessarily complicated. Am I missing something?
>>>>>
>>>>
>>>> So is it always in the same NAPI context?. I forgot, I was thinking
>>>> that somehow the socket could possibly make use of XDP for transmit.
>>>
>>> AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
>>> is used in zero-copy mode. Real xmit happens inside
>>> ixgbe_poll()
>>>  -> ixgbe_clean_xdp_tx_irq()
>>> -> ixgbe_xmit_zc()
>>>
>>> This should be not possible to bound another XDP socket to the same netdev
>>> queue.
>>>
>>> It also possible to xmit frames in xdp_ring while performing XDP_TX/REDIRECT
>>> actions. REDIRECT could happen from different netdev with different NAPI
>>> context, but this operation is bound to specific CPU core and each core has
>>> its own xdp_ring.
>>>
>>> However, I'm not an expert here.
>>> Björn, maybe you could comment on this?
>>>
>>>>
>>>> As far as the logic to use I would be good with just using a value you
>>>> are already setting such as the bytecount value. All that would need
>>>> to happen is to guarantee that the value is cleared in the Tx path. So
>>>> if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
>>>> theoretically just use that as well to flag that a descriptor has been
>>>> populated and is ready to be cleaned. Assuming the logic about this
>>>> all being in the same NAPI context anyway you wouldn't need to mess
>>>> with the barrier stuff I mentioned before.
>>>
>>> Checking the number of used descs, i.e. next_to_use - next_to_clean,
>>> makes iteration in this function logically equal to the iteration inside
>>> 'ixgbe_xsk_clean_tx_ring()'. Do you think we need to change the later
>>> 

Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp

2019-08-21 Thread Ilya Maximets
On 21.08.2019 4:17, Alexander Duyck wrote:
> On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets  wrote:
>>
>> On 20.08.2019 18:35, Alexander Duyck wrote:
>>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets  
>>> wrote:
>>>>
>>>> Tx code doesn't clear the descriptor status after cleaning.
>>>> So, if the budget is larger than number of used elems in a ring, some
>>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>>>
>>>> Fix that by limiting the number of descriptors to clean by the number
>>>> of used descriptors in the tx ring.
>>>>
>>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>>>> Signed-off-by: Ilya Maximets 
>>>
>>> I'm not sure this is the best way to go. My preference would be to
>>> have something in the ring that would prevent us from racing which I
>>> don't think this really addresses. I am pretty sure this code is safe
>>> on x86 but I would be worried about weak ordered systems such as
>>> PowerPC.
>>>
>>> It might make sense to look at adding the eop_desc logic like we have
>>> in the regular path with a proper barrier before we write it and after
>>> we read it. So for example we could hold of on writing the bytecount
>>> value until the end of an iteration and call smp_wmb before we write
>>> it. Then on the cleanup we could read it and if it is non-zero we take
>>> an smp_rmb before proceeding further to process the Tx descriptor and
>>> clearing the value. Otherwise this code is going to just keep popping
>>> up with issues.
>>
>> But, unlike regular case, xdp zero-copy xmit and clean for particular
>> tx ring always happens in the same NAPI context and even on the same
>> CPU core.
>>
>> I saw the 'eop_desc' manipulations in regular case and yes, we could
>> use 'next_to_watch' field just as a flag of descriptor existence,
>> but it seems unnecessarily complicated. Am I missing something?
>>
> 
> So is it always in the same NAPI context?. I forgot, I was thinking
> that somehow the socket could possibly make use of XDP for transmit.

AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
is used in zero-copy mode. Real xmit happens inside
ixgbe_poll()
 -> ixgbe_clean_xdp_tx_irq()
-> ixgbe_xmit_zc()

This should be not possible to bound another XDP socket to the same netdev
queue.

It also possible to xmit frames in xdp_ring while performing XDP_TX/REDIRECT
actions. REDIRECT could happen from different netdev with different NAPI
context, but this operation is bound to specific CPU core and each core has
its own xdp_ring.

However, I'm not an expert here.
Björn, maybe you could comment on this?

> 
> As far as the logic to use I would be good with just using a value you
> are already setting such as the bytecount value. All that would need
> to happen is to guarantee that the value is cleared in the Tx path. So
> if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
> theoretically just use that as well to flag that a descriptor has been
> populated and is ready to be cleaned. Assuming the logic about this
> all being in the same NAPI context anyway you wouldn't need to mess
> with the barrier stuff I mentioned before.

Checking the number of used descs, i.e. next_to_use - next_to_clean,
makes iteration in this function logically equal to the iteration inside
'ixgbe_xsk_clean_tx_ring()'. Do you think we need to change the later
function too to follow same 'bytecount' approach? I don't like having
two different ways to determine number of used descriptors in the same file.

Best regards, Ilya Maximets.


Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp

2019-08-20 Thread Ilya Maximets
On 20.08.2019 18:35, Alexander Duyck wrote:
> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets  wrote:
>>
>> Tx code doesn't clear the descriptor status after cleaning.
>> So, if the budget is larger than number of used elems in a ring, some
>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>
>> Fix that by limiting the number of descriptors to clean by the number
>> of used descriptors in the tx ring.
>>
>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>> Signed-off-by: Ilya Maximets 
> 
> I'm not sure this is the best way to go. My preference would be to
> have something in the ring that would prevent us from racing which I
> don't think this really addresses. I am pretty sure this code is safe
> on x86 but I would be worried about weak ordered systems such as
> PowerPC.
> 
> It might make sense to look at adding the eop_desc logic like we have
> in the regular path with a proper barrier before we write it and after
> we read it. So for example we could hold of on writing the bytecount
> value until the end of an iteration and call smp_wmb before we write
> it. Then on the cleanup we could read it and if it is non-zero we take
> an smp_rmb before proceeding further to process the Tx descriptor and
> clearing the value. Otherwise this code is going to just keep popping
> up with issues.

But, unlike regular case, xdp zero-copy xmit and clean for particular
tx ring always happens in the same NAPI context and even on the same
CPU core.

I saw the 'eop_desc' manipulations in regular case and yes, we could
use 'next_to_watch' field just as a flag of descriptor existence,
but it seems unnecessarily complicated. Am I missing something?

> 
>> ---
>>
>> Not tested yet because of lack of available hardware.
>> So, testing is very welcome.
>>
>>  drivers/net/ethernet/intel/ixgbe/ixgbe.h  | 10 ++
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 12 +---
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |  6 --
>>  3 files changed, 15 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> index 39e73ad60352..0befcef46e80 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> @@ -512,6 +512,16 @@ static inline u16 ixgbe_desc_unused(struct ixgbe_ring 
>> *ring)
>> return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
>>  }
>>
>> +static inline u64 ixgbe_desc_used(struct ixgbe_ring *ring)
>> +{
>> +   unsigned int head, tail;
>> +
>> +   head = ring->next_to_clean;
>> +   tail = ring->next_to_use;
>> +
>> +   return ((head <= tail) ? tail : tail + ring->count) - head;
>> +}
>> +
>>  #define IXGBE_RX_DESC(R, i)\
>> (&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
>>  #define IXGBE_TX_DESC(R, i)\
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index 7882148abb43..d417237857d8 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -1012,21 +1012,11 @@ static u64 ixgbe_get_tx_completed(struct ixgbe_ring 
>> *ring)
>> return ring->stats.packets;
>>  }
>>
>> -static u64 ixgbe_get_tx_pending(struct ixgbe_ring *ring)
>> -{
>> -   unsigned int head, tail;
>> -
>> -   head = ring->next_to_clean;
>> -   tail = ring->next_to_use;
>> -
>> -   return ((head <= tail) ? tail : tail + ring->count) - head;
>> -}
>> -
>>  static inline bool ixgbe_check_tx_hang(struct ixgbe_ring *tx_ring)
>>  {
>> u32 tx_done = ixgbe_get_tx_completed(tx_ring);
>> u32 tx_done_old = tx_ring->tx_stats.tx_done_old;
>> -   u32 tx_pending = ixgbe_get_tx_pending(tx_ring);
>> +   u32 tx_pending = ixgbe_desc_used(tx_ring);
>>
>> clear_check_for_tx_hang(tx_ring);
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c 
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> index 6b609553329f..7702efed356a 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> @@ -637,6 +637,7 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector 
>> *q_vector,
>> u32 i = tx_ring->next_to_clean, xsk_frames = 0;

[PATCH net] ixgbe: fix double clean of tx descriptors with xdp

2019-08-20 Thread Ilya Maximets
Tx code doesn't clear the descriptor status after cleaning.
So, if the budget is larger than number of used elems in a ring, some
descriptors will be accounted twice and xsk_umem_complete_tx will move
prod_tail far beyond the prod_head breaking the comletion queue ring.

Fix that by limiting the number of descriptors to clean by the number
of used descriptors in the tx ring.

Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
Signed-off-by: Ilya Maximets 
---

Not tested yet because of lack of available hardware.
So, testing is very welcome.

 drivers/net/ethernet/intel/ixgbe/ixgbe.h  | 10 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 12 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |  6 --
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 39e73ad60352..0befcef46e80 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -512,6 +512,16 @@ static inline u16 ixgbe_desc_unused(struct ixgbe_ring 
*ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
+static inline u64 ixgbe_desc_used(struct ixgbe_ring *ring)
+{
+   unsigned int head, tail;
+
+   head = ring->next_to_clean;
+   tail = ring->next_to_use;
+
+   return ((head <= tail) ? tail : tail + ring->count) - head;
+}
+
 #define IXGBE_RX_DESC(R, i)\
(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBE_TX_DESC(R, i)\
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7882148abb43..d417237857d8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1012,21 +1012,11 @@ static u64 ixgbe_get_tx_completed(struct ixgbe_ring 
*ring)
return ring->stats.packets;
 }
 
-static u64 ixgbe_get_tx_pending(struct ixgbe_ring *ring)
-{
-   unsigned int head, tail;
-
-   head = ring->next_to_clean;
-   tail = ring->next_to_use;
-
-   return ((head <= tail) ? tail : tail + ring->count) - head;
-}
-
 static inline bool ixgbe_check_tx_hang(struct ixgbe_ring *tx_ring)
 {
u32 tx_done = ixgbe_get_tx_completed(tx_ring);
u32 tx_done_old = tx_ring->tx_stats.tx_done_old;
-   u32 tx_pending = ixgbe_get_tx_pending(tx_ring);
+   u32 tx_pending = ixgbe_desc_used(tx_ring);
 
clear_check_for_tx_hang(tx_ring);
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 6b609553329f..7702efed356a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -637,6 +637,7 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
u32 i = tx_ring->next_to_clean, xsk_frames = 0;
unsigned int budget = q_vector->tx.work_limit;
struct xdp_umem *umem = tx_ring->xsk_umem;
+   u32 used_descs = ixgbe_desc_used(tx_ring);
union ixgbe_adv_tx_desc *tx_desc;
struct ixgbe_tx_buffer *tx_bi;
bool xmit_done;
@@ -645,7 +646,7 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
tx_desc = IXGBE_TX_DESC(tx_ring, i);
i -= tx_ring->count;
 
-   do {
+   while (likely(budget && used_descs)) {
if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
break;
 
@@ -673,7 +674,8 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
 
/* update budget accounting */
budget--;
-   } while (likely(budget));
+   used_descs--;
+   }
 
i += tx_ring->count;
tx_ring->next_to_clean = i;
-- 
2.17.1



[PATCH bpf] libbpf: fix using uninitialized ioctl results

2019-07-23 Thread Ilya Maximets
'channels.max_combined' initialized only on ioctl success and
errno is only valid on ioctl failure.

The code doesn't produce any runtime issues, but makes memory
sanitizers angry:

 Conditional jump or move depends on uninitialised value(s)
at 0x55C056F: xsk_get_max_queues (xsk.c:336)
by 0x55C05B2: xsk_create_bpf_maps (xsk.c:354)
by 0x55C089F: xsk_setup_xdp_prog (xsk.c:447)
by 0x55C0E57: xsk_socket__create (xsk.c:601)
  Uninitialised value was created by a stack allocation
at 0x55C04CD: xsk_get_max_queues (xsk.c:318)

Additionally fixed warning on uninitialized bytes in ioctl arguments:

 Syscall param ioctl(SIOCETHTOOL) points to uninitialised byte(s)
at 0x648D45B: ioctl (in /usr/lib64/libc-2.28.so)
by 0x55C0546: xsk_get_max_queues (xsk.c:330)
by 0x55C05B2: xsk_create_bpf_maps (xsk.c:354)
by 0x55C089F: xsk_setup_xdp_prog (xsk.c:447)
by 0x55C0E57: xsk_socket__create (xsk.c:601)
  Address 0x1ffefff378 is on thread 1's stack
  in frame #1, created by xsk_get_max_queues (xsk.c:318)
  Uninitialised value was created by a stack allocation
at 0x55C04CD: xsk_get_max_queues (xsk.c:318)

CC: Magnus Karlsson 
Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
Signed-off-by: Ilya Maximets 
---
 tools/lib/bpf/xsk.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index 5007b5d4fd2c..c4f912dc30f9 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -317,7 +317,7 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
 
 static int xsk_get_max_queues(struct xsk_socket *xsk)
 {
-   struct ethtool_channels channels;
+   struct ethtool_channels channels = { .cmd = ETHTOOL_GCHANNELS };
struct ifreq ifr;
int fd, err, ret;
 
@@ -325,7 +325,7 @@ static int xsk_get_max_queues(struct xsk_socket *xsk)
if (fd < 0)
return -errno;
 
-   channels.cmd = ETHTOOL_GCHANNELS;
+   memset(, 0, sizeof(ifr));
ifr.ifr_data = (void *)
strncpy(ifr.ifr_name, xsk->ifname, IFNAMSIZ - 1);
ifr.ifr_name[IFNAMSIZ - 1] = '\0';
@@ -335,7 +335,7 @@ static int xsk_get_max_queues(struct xsk_socket *xsk)
goto out;
}
 
-   if (channels.max_combined == 0 || errno == EOPNOTSUPP)
+   if (err || channels.max_combined == 0)
/* If the device says it has no channels, then all traffic
 * is sent to a single stream, so max queues = 1.
 */
-- 
2.17.1



[PATCH bpf] xdp: fix potential deadlock on socket mutex

2019-07-08 Thread Ilya Maximets
There are 2 call chains:

  a) xsk_bind --> xdp_umem_assign_dev
  b) unregister_netdevice_queue --> xsk_notifier

with the following locking order:

  a) xs->mutex --> rtnl_lock
  b) rtnl_lock --> xdp.lock --> xs->mutex

Different order of taking 'xs->mutex' and 'rtnl_lock' could produce a
deadlock here. Fix that by moving the 'rtnl_lock' before 'xs->lock' in
the bind call chain (a).

Reported-by: syzbot+bf64ec93de836d7f4...@syzkaller.appspotmail.com
Fixes: 455302d1c9ae ("xdp: fix hang while unregistering device bound to xdp 
socket")
Signed-off-by: Ilya Maximets 
---

This patch is a fix for patch that is not yet in mainline, but
already in 'net' tree. I'm not sure what is the correct process
for applying such fixes.

 net/xdp/xdp_umem.c | 16 ++--
 net/xdp/xsk.c  |  2 ++
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 20c91f02d3d8..83de74ca729a 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -87,21 +87,20 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
struct netdev_bpf bpf;
int err = 0;
 
+   ASSERT_RTNL();
+
force_zc = flags & XDP_ZEROCOPY;
force_copy = flags & XDP_COPY;
 
if (force_zc && force_copy)
return -EINVAL;
 
-   rtnl_lock();
-   if (xdp_get_umem_from_qid(dev, queue_id)) {
-   err = -EBUSY;
-   goto out_rtnl_unlock;
-   }
+   if (xdp_get_umem_from_qid(dev, queue_id))
+   return -EBUSY;
 
err = xdp_reg_umem_at_qid(dev, umem, queue_id);
if (err)
-   goto out_rtnl_unlock;
+   return err;
 
umem->dev = dev;
umem->queue_id = queue_id;
@@ -110,7 +109,7 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
 
if (force_copy)
/* For copy-mode, we are done. */
-   goto out_rtnl_unlock;
+   return 0;
 
if (!dev->netdev_ops->ndo_bpf ||
!dev->netdev_ops->ndo_xsk_async_xmit) {
@@ -125,7 +124,6 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
err = dev->netdev_ops->ndo_bpf(dev, );
if (err)
goto err_unreg_umem;
-   rtnl_unlock();
 
umem->zc = true;
return 0;
@@ -135,8 +133,6 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
err = 0; /* fallback to copy mode */
if (err)
xdp_clear_umem_at_qid(dev, queue_id);
-out_rtnl_unlock:
-   rtnl_unlock();
return err;
 }
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 703cf5ea448b..2aa6072a3e55 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -416,6 +416,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr 
*addr, int addr_len)
if (flags & ~(XDP_SHARED_UMEM | XDP_COPY | XDP_ZEROCOPY))
return -EINVAL;
 
+   rtnl_lock();
mutex_lock(>mutex);
if (xs->state != XSK_READY) {
err = -EBUSY;
@@ -501,6 +502,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr 
*addr, int addr_len)
xs->state = XSK_BOUND;
 out_release:
mutex_unlock(>mutex);
+   rtnl_unlock();
return err;
 }
 
-- 
2.17.1



[PATCH bpf] xdp: fix possible cq entry leak

2019-07-04 Thread Ilya Maximets
Completion queue address reservation could not be undone.
In case of bad 'queue_id' or skb allocation failure, reserved entry
will be leaked reducing the total capacity of completion queue.

Fix that by moving reservation to the point where failure is not
possible. Additionally, 'queue_id' checking moved out from the loop
since there is no point to check it there.

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Ilya Maximets 
---
 net/xdp/xsk.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f53a6ef7c155..703cf5ea448b 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -226,6 +226,9 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr 
*m,
 
mutex_lock(>mutex);
 
+   if (xs->queue_id >= xs->dev->real_num_tx_queues)
+   goto out;
+
while (xskq_peek_desc(xs->tx, )) {
char *buffer;
u64 addr;
@@ -236,12 +239,6 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr 
*m,
goto out;
}
 
-   if (xskq_reserve_addr(xs->umem->cq))
-   goto out;
-
-   if (xs->queue_id >= xs->dev->real_num_tx_queues)
-   goto out;
-
len = desc.len;
skb = sock_alloc_send_skb(sk, len, 1, );
if (unlikely(!skb)) {
@@ -253,7 +250,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr 
*m,
addr = desc.addr;
buffer = xdp_umem_get_data(xs->umem, addr);
err = skb_store_bits(skb, 0, buffer, len);
-   if (unlikely(err)) {
+   if (unlikely(err) || xskq_reserve_addr(xs->umem->cq)) {
kfree_skb(skb);
goto out;
}
-- 
2.17.1



[PATCH bpf v2] xdp: fix race on generic receive path

2019-07-03 Thread Ilya Maximets
Unlike driver mode, generic xdp receive could be triggered
by different threads on different CPU cores at the same time
leading to the fill and rx queue breakage. For example, this
could happen while sending packets from two processes to the
first interface of veth pair while the second part of it is
open with AF_XDP socket.

Need to take a lock for each generic receive to avoid race.

Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
Signed-off-by: Ilya Maximets 
---

Version 2:
* spin_lock_irqsave --> spin_lock_bh.

 include/net/xdp_sock.h |  2 ++
 net/xdp/xsk.c  | 31 ++-
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index d074b6d60f8a..ac3c047d058c 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -67,6 +67,8 @@ struct xdp_sock {
 * in the SKB destructor callback.
 */
spinlock_t tx_completion_lock;
+   /* Protects generic receive. */
+   spinlock_t rx_lock;
u64 rx_dropped;
 };
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a14e8864e4fa..5e0637db92ea 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -123,13 +123,17 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
u64 addr;
int err;
 
-   if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
-   return -EINVAL;
+   spin_lock_bh(>rx_lock);
+
+   if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index) {
+   err = -EINVAL;
+   goto out_unlock;
+   }
 
if (!xskq_peek_addr(xs->umem->fq, ) ||
len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
-   xs->rx_dropped++;
-   return -ENOSPC;
+   err = -ENOSPC;
+   goto out_drop;
}
 
addr += xs->umem->headroom;
@@ -138,13 +142,21 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
memcpy(buffer, xdp->data_meta, len + metalen);
addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
-   if (!err) {
-   xskq_discard_addr(xs->umem->fq);
-   xsk_flush(xs);
-   return 0;
-   }
+   if (err)
+   goto out_drop;
+
+   xskq_discard_addr(xs->umem->fq);
+   xskq_produce_flush_desc(xs->rx);
 
+   spin_unlock_bh(>rx_lock);
+
+   xs->sk.sk_data_ready(>sk);
+   return 0;
+
+out_drop:
xs->rx_dropped++;
+out_unlock:
+   spin_unlock_bh(>rx_lock);
return err;
 }
 
@@ -765,6 +777,7 @@ static int xsk_create(struct net *net, struct socket *sock, 
int protocol,
 
xs = xdp_sk(sk);
mutex_init(>mutex);
+   spin_lock_init(>rx_lock);
spin_lock_init(>tx_completion_lock);
 
mutex_lock(>xdp.lock);
-- 
2.17.1



Re: [PATCH bpf] xdp: fix race on generic receive path

2019-07-03 Thread Ilya Maximets
On 03.07.2019 3:40, Jakub Kicinski wrote:
> On Tue,  2 Jul 2019 17:36:34 +0300, Ilya Maximets wrote:
>> Unlike driver mode, generic xdp receive could be triggered
>> by different threads on different CPU cores at the same time
>> leading to the fill and rx queue breakage. For example, this
>> could happen while sending packets from two processes to the
>> first interface of veth pair while the second part of it is
>> open with AF_XDP socket.
>>
>> Need to take a lock for each generic receive to avoid race.
>>
>> Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
>> Signed-off-by: Ilya Maximets 
> 
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index a14e8864e4fa..19f41d2b670c 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -119,17 +119,22 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
>> xdp_buff *xdp)
>>  {
>>  u32 metalen = xdp->data - xdp->data_meta;
>>  u32 len = xdp->data_end - xdp->data;
>> +unsigned long flags;
>>  void *buffer;
>>  u64 addr;
>>  int err;
>>  
>> -if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
>> -return -EINVAL;
>> +spin_lock_irqsave(>rx_lock, flags);
> 
> Why _irqsave, rather than _bh?

Yes, spin_lock_bh() is enough here. Will change in v2.
Thanks.

> 
>> +if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index) {
>> +err = -EINVAL;
>> +goto out_unlock;
>> +}
>>  
>>  if (!xskq_peek_addr(xs->umem->fq, ) ||
>>  len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>> -xs->rx_dropped++;
>> -return -ENOSPC;
>> +err = -ENOSPC;
>> +goto out_drop;
>>  }
>>  
>>  addr += xs->umem->headroom;
> 
> 
> 


Re: [PATCH bpf] xdp: fix race on generic receive path

2019-07-02 Thread Ilya Maximets
On 02.07.2019 18:01, Magnus Karlsson wrote:
> On Tue, Jul 2, 2019 at 4:36 PM Ilya Maximets  wrote:
>>
>> Unlike driver mode, generic xdp receive could be triggered
>> by different threads on different CPU cores at the same time
>> leading to the fill and rx queue breakage. For example, this
>> could happen while sending packets from two processes to the
>> first interface of veth pair while the second part of it is
>> open with AF_XDP socket.
>>
>> Need to take a lock for each generic receive to avoid race.
> 
> Thanks for this catch Ilya. Do you have any performance numbers you
> could share of the impact of adding this spin lock? The reason I ask
> is that if the impact is negligible, then let us just add it. But if
> it is too large, we might want to brain storm about some other
> possible solutions.

Hi. Unfortunately, I don't have a hardware for performance tests right
now, so I could run only tests over virtual interfaces like veth pair.
It'll be good if someone could check the performance with real HW.

Best regards, Ilya Maximets.

> 
> Thanks: Magnus
> 
>> Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
>> Signed-off-by: Ilya Maximets 
>> ---
>>  include/net/xdp_sock.h |  2 ++
>>  net/xdp/xsk.c  | 32 +++-
>>  2 files changed, 25 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
>> index d074b6d60f8a..ac3c047d058c 100644
>> --- a/include/net/xdp_sock.h
>> +++ b/include/net/xdp_sock.h
>> @@ -67,6 +67,8 @@ struct xdp_sock {
>>  * in the SKB destructor callback.
>>  */
>> spinlock_t tx_completion_lock;
>> +   /* Protects generic receive. */
>> +   spinlock_t rx_lock;
>> u64 rx_dropped;
>>  };
>>
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index a14e8864e4fa..19f41d2b670c 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -119,17 +119,22 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
>> xdp_buff *xdp)
>>  {
>> u32 metalen = xdp->data - xdp->data_meta;
>> u32 len = xdp->data_end - xdp->data;
>> +   unsigned long flags;
>> void *buffer;
>> u64 addr;
>> int err;
>>
>> -   if (xs->dev != xdp->rxq->dev || xs->queue_id != 
>> xdp->rxq->queue_index)
>> -   return -EINVAL;
>> +   spin_lock_irqsave(>rx_lock, flags);
>> +
>> +   if (xs->dev != xdp->rxq->dev || xs->queue_id != 
>> xdp->rxq->queue_index) {
>> +   err = -EINVAL;
>> +   goto out_unlock;
>> +   }
>>
>> if (!xskq_peek_addr(xs->umem->fq, ) ||
>> len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>> -   xs->rx_dropped++;
>> -   return -ENOSPC;
>> +   err = -ENOSPC;
>> +   goto out_drop;
>> }
>>
>> addr += xs->umem->headroom;
>> @@ -138,13 +143,21 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
>> xdp_buff *xdp)
>> memcpy(buffer, xdp->data_meta, len + metalen);
>> addr += metalen;
>> err = xskq_produce_batch_desc(xs->rx, addr, len);
>> -   if (!err) {
>> -   xskq_discard_addr(xs->umem->fq);
>> -   xsk_flush(xs);
>> -   return 0;
>> -   }
>> +   if (err)
>> +   goto out_drop;
>> +
>> +   xskq_discard_addr(xs->umem->fq);
>> +   xskq_produce_flush_desc(xs->rx);
>>
>> +   spin_unlock_irqrestore(>rx_lock, flags);
>> +
>> +   xs->sk.sk_data_ready(>sk);
>> +   return 0;
>> +
>> +out_drop:
>> xs->rx_dropped++;
>> +out_unlock:
>> +   spin_unlock_irqrestore(>rx_lock, flags);
>> return err;
>>  }
>>
>> @@ -765,6 +778,7 @@ static int xsk_create(struct net *net, struct socket 
>> *sock, int protocol,
>>
>> xs = xdp_sk(sk);
>> mutex_init(>mutex);
>> +   spin_lock_init(>rx_lock);
>> spin_lock_init(>tx_completion_lock);
>>
>> mutex_lock(>xdp.lock);
>> --
>> 2.17.1


[PATCH bpf] xdp: fix race on generic receive path

2019-07-02 Thread Ilya Maximets
Unlike driver mode, generic xdp receive could be triggered
by different threads on different CPU cores at the same time
leading to the fill and rx queue breakage. For example, this
could happen while sending packets from two processes to the
first interface of veth pair while the second part of it is
open with AF_XDP socket.

Need to take a lock for each generic receive to avoid race.

Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
Signed-off-by: Ilya Maximets 
---
 include/net/xdp_sock.h |  2 ++
 net/xdp/xsk.c  | 32 +++-
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index d074b6d60f8a..ac3c047d058c 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -67,6 +67,8 @@ struct xdp_sock {
 * in the SKB destructor callback.
 */
spinlock_t tx_completion_lock;
+   /* Protects generic receive. */
+   spinlock_t rx_lock;
u64 rx_dropped;
 };
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a14e8864e4fa..19f41d2b670c 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -119,17 +119,22 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
 {
u32 metalen = xdp->data - xdp->data_meta;
u32 len = xdp->data_end - xdp->data;
+   unsigned long flags;
void *buffer;
u64 addr;
int err;
 
-   if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
-   return -EINVAL;
+   spin_lock_irqsave(>rx_lock, flags);
+
+   if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index) {
+   err = -EINVAL;
+   goto out_unlock;
+   }
 
if (!xskq_peek_addr(xs->umem->fq, ) ||
len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
-   xs->rx_dropped++;
-   return -ENOSPC;
+   err = -ENOSPC;
+   goto out_drop;
}
 
addr += xs->umem->headroom;
@@ -138,13 +143,21 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
memcpy(buffer, xdp->data_meta, len + metalen);
addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
-   if (!err) {
-   xskq_discard_addr(xs->umem->fq);
-   xsk_flush(xs);
-   return 0;
-   }
+   if (err)
+   goto out_drop;
+
+   xskq_discard_addr(xs->umem->fq);
+   xskq_produce_flush_desc(xs->rx);
 
+   spin_unlock_irqrestore(>rx_lock, flags);
+
+   xs->sk.sk_data_ready(>sk);
+   return 0;
+
+out_drop:
xs->rx_dropped++;
+out_unlock:
+   spin_unlock_irqrestore(>rx_lock, flags);
return err;
 }
 
@@ -765,6 +778,7 @@ static int xsk_create(struct net *net, struct socket *sock, 
int protocol,
 
xs = xdp_sk(sk);
mutex_init(>mutex);
+   spin_lock_init(>rx_lock);
spin_lock_init(>tx_completion_lock);
 
mutex_lock(>xdp.lock);
-- 
2.17.1



[PATCH bpf v6 2/2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-28 Thread Ilya Maximets
Device that bound to XDP socket will not have zero refcount until the
userspace application will not close it. This leads to hang inside
'netdev_wait_allrefs()' if device unregistering requested:

  # ip link del p1
  < hang on recvmsg on netlink socket >

  # ps -x | grep ip
  5126  pts/0D+   0:00 ip link del p1

  # journalctl -b

  Jun 05 07:19:16 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1

  Jun 05 07:19:27 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1
  ...

Fix that by implementing NETDEV_UNREGISTER event notification handler
to properly clean up all the resources and unref device.

This should also allow socket killing via ss(8) utility.

Fixes: 965a99098443 ("xsk: add support for bind for Rx")
Signed-off-by: Ilya Maximets 
Acked-by: Jonathan Lemon 
---
 include/net/xdp_sock.h |  5 +++
 net/xdp/xdp_umem.c | 10 ++---
 net/xdp/xdp_umem.h |  1 +
 net/xdp/xsk.c  | 87 --
 4 files changed, 87 insertions(+), 16 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index d074b6d60f8a..82d153a637c7 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -61,6 +61,11 @@ struct xdp_sock {
struct xsk_queue *tx cacheline_aligned_in_smp;
struct list_head list;
bool zc;
+   enum {
+   XSK_READY = 0,
+   XSK_BOUND,
+   XSK_UNBOUND,
+   } state;
/* Protects multiple processes in the control path */
struct mutex mutex;
/* Mutual exclusion of NAPI TX thread and sendmsg error paths
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 267b82a4cbcf..20c91f02d3d8 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -140,11 +140,13 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
return err;
 }
 
-static void xdp_umem_clear_dev(struct xdp_umem *umem)
+void xdp_umem_clear_dev(struct xdp_umem *umem)
 {
struct netdev_bpf bpf;
int err;
 
+   ASSERT_RTNL();
+
if (!umem->dev)
return;
 
@@ -153,17 +155,13 @@ static void xdp_umem_clear_dev(struct xdp_umem *umem)
bpf.xsk.umem = NULL;
bpf.xsk.queue_id = umem->queue_id;
 
-   rtnl_lock();
err = umem->dev->netdev_ops->ndo_bpf(umem->dev, );
-   rtnl_unlock();
 
if (err)
WARN(1, "failed to disable umem!\n");
}
 
-   rtnl_lock();
xdp_clear_umem_at_qid(umem->dev, umem->queue_id);
-   rtnl_unlock();
 
dev_put(umem->dev);
umem->dev = NULL;
@@ -195,7 +193,9 @@ static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
 
 static void xdp_umem_release(struct xdp_umem *umem)
 {
+   rtnl_lock();
xdp_umem_clear_dev(umem);
+   rtnl_unlock();
 
ida_simple_remove(_ida, umem->id);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 27603227601b..a63a9fb251f5 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -10,6 +10,7 @@
 
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
u16 queue_id, u16 flags);
+void xdp_umem_clear_dev(struct xdp_umem *umem);
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a14e8864e4fa..336723948a36 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -335,6 +335,22 @@ static int xsk_init_queue(u32 entries, struct xsk_queue 
**queue,
return 0;
 }
 
+static void xsk_unbind_dev(struct xdp_sock *xs)
+{
+   struct net_device *dev = xs->dev;
+
+   if (!dev || xs->state != XSK_BOUND)
+   return;
+
+   xs->state = XSK_UNBOUND;
+
+   /* Wait for driver to stop using the xdp socket. */
+   xdp_del_sk_umem(xs->umem, xs);
+   xs->dev = NULL;
+   synchronize_net();
+   dev_put(dev);
+}
+
 static int xsk_release(struct socket *sock)
 {
struct sock *sk = sock->sk;
@@ -354,15 +370,7 @@ static int xsk_release(struct socket *sock)
sock_prot_inuse_add(net, sk->sk_prot, -1);
local_bh_enable();
 
-   if (xs->dev) {
-   struct net_device *dev = xs->dev;
-
-   /* Wait for driver to stop using the xdp socket. */
-   xdp_del_sk_umem(xs->umem, xs);
-   xs->dev = NULL;
-   synchronize_net();
-   dev_put(dev);
-   }
+   xsk_unbind_dev(xs);
 
xskq_destroy(xs->rx);
xskq_destroy(xs->tx);
@@ -412,7 +420,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr 
*addr, int addr_len)
return -EINVAL;
 
mutex_lock(>mutex);
-   if (xs->dev) {
+   if (xs->state != XSK_READY) {
  

[PATCH bpf v6 0/2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-28 Thread Ilya Maximets
Version 6:

* Better names for socket state.

Version 5:

* Fixed incorrect handling of rtnl_lock.

Version 4:

* 'xdp_umem_clear_dev' exposed to be used while unregistering.
* Added XDP socket state to track if resources already unbinded.
* Splitted in two fixes.

Version 3:

* Declaration lines ordered from longest to shortest.
* Checking of event type moved to the top to avoid unnecessary
  locking.

Version 2:

* Completely re-implemented using netdev event handler.

Ilya Maximets (2):
  xdp: hold device for umem regardless of zero-copy mode
  xdp: fix hang while unregistering device bound to xdp socket

 include/net/xdp_sock.h |  5 +++
 net/xdp/xdp_umem.c | 21 +-
 net/xdp/xdp_umem.h |  1 +
 net/xdp/xsk.c  | 87 --
 4 files changed, 93 insertions(+), 21 deletions(-)

-- 
2.17.1



[PATCH bpf v6 1/2] xdp: hold device for umem regardless of zero-copy mode

2019-06-28 Thread Ilya Maximets
Device pointer stored in umem regardless of zero-copy mode,
so we heed to hold the device in all cases.

Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy 
on one queue id")
Signed-off-by: Ilya Maximets 
Acked-by: Jonathan Lemon 
---
 net/xdp/xdp_umem.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 9c6de4f114f8..267b82a4cbcf 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -105,6 +105,9 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
 
umem->dev = dev;
umem->queue_id = queue_id;
+
+   dev_hold(dev);
+
if (force_copy)
/* For copy-mode, we are done. */
goto out_rtnl_unlock;
@@ -124,7 +127,6 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
goto err_unreg_umem;
rtnl_unlock();
 
-   dev_hold(dev);
umem->zc = true;
return 0;
 
@@ -163,10 +165,9 @@ static void xdp_umem_clear_dev(struct xdp_umem *umem)
xdp_clear_umem_at_qid(umem->dev, umem->queue_id);
rtnl_unlock();
 
-   if (umem->zc) {
-   dev_put(umem->dev);
-   umem->zc = false;
-   }
+   dev_put(umem->dev);
+   umem->dev = NULL;
+   umem->zc = false;
 }
 
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
-- 
2.17.1



Re: [PATCH bpf v5 2/2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-28 Thread Ilya Maximets
On 28.06.2019 1:04, Jonathan Lemon wrote:
> On 27 Jun 2019, at 3:15, Ilya Maximets wrote:
> 
>> Device that bound to XDP socket will not have zero refcount until the
>> userspace application will not close it. This leads to hang inside
>> 'netdev_wait_allrefs()' if device unregistering requested:
>>
>>   # ip link del p1
>>   < hang on recvmsg on netlink socket >
>>
>>   # ps -x | grep ip
>>   5126  pts/0    D+   0:00 ip link del p1
>>
>>   # journalctl -b
>>
>>   Jun 05 07:19:16 kernel:
>>   unregister_netdevice: waiting for p1 to become free. Usage count = 1
>>
>>   Jun 05 07:19:27 kernel:
>>   unregister_netdevice: waiting for p1 to become free. Usage count = 1
>>   ...
>>
>> Fix that by implementing NETDEV_UNREGISTER event notification handler
>> to properly clean up all the resources and unref device.
>>
>> This should also allow socket killing via ss(8) utility.
>>
>> Fixes: 965a99098443 ("xsk: add support for bind for Rx")
>> Signed-off-by: Ilya Maximets 
>> ---
>>  include/net/xdp_sock.h |  5 +++
>>  net/xdp/xdp_umem.c | 10 ++---
>>  net/xdp/xdp_umem.h |  1 +
>>  net/xdp/xsk.c  | 87 --
>>  4 files changed, 87 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
>> index d074b6d60f8a..82d153a637c7 100644
>> --- a/include/net/xdp_sock.h
>> +++ b/include/net/xdp_sock.h
>> @@ -61,6 +61,11 @@ struct xdp_sock {
>>  struct xsk_queue *tx cacheline_aligned_in_smp;
>>  struct list_head list;
>>  bool zc;
>> +    enum {
>> +    XSK_UNINITIALIZED = 0,
>> +    XSK_BINDED,
>> +    XSK_UNBINDED,
>> +    } state;
> 
> I'd prefer that these were named better, perhaps:
>    XSK_READY,
>    XSK_BOUND,
>    XSK_UNBOUND,

Sure. Thanks for suggestion!

> 
> Other than that:
> Acked-by: Jonathan Lemon 
> 

I'll send a new version with the new state names keeping your ACK.

Best regards, Ilya Maximets.


[PATCH bpf v5 1/2] xdp: hold device for umem regardless of zero-copy mode

2019-06-27 Thread Ilya Maximets
Device pointer stored in umem regardless of zero-copy mode,
so we heed to hold the device in all cases.

Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy 
on one queue id")
Signed-off-by: Ilya Maximets 
---
 net/xdp/xdp_umem.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 9c6de4f114f8..267b82a4cbcf 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -105,6 +105,9 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
 
umem->dev = dev;
umem->queue_id = queue_id;
+
+   dev_hold(dev);
+
if (force_copy)
/* For copy-mode, we are done. */
goto out_rtnl_unlock;
@@ -124,7 +127,6 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
goto err_unreg_umem;
rtnl_unlock();
 
-   dev_hold(dev);
umem->zc = true;
return 0;
 
@@ -163,10 +165,9 @@ static void xdp_umem_clear_dev(struct xdp_umem *umem)
xdp_clear_umem_at_qid(umem->dev, umem->queue_id);
rtnl_unlock();
 
-   if (umem->zc) {
-   dev_put(umem->dev);
-   umem->zc = false;
-   }
+   dev_put(umem->dev);
+   umem->dev = NULL;
+   umem->zc = false;
 }
 
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
-- 
2.17.1



[PATCH bpf v5 0/2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-27 Thread Ilya Maximets
Version 5:

* Fixed incorrect handling of rtnl_lock.

Version 4:

* 'xdp_umem_clear_dev' exposed to be used while unregistering.
* Added XDP socket state to track if resources already unbinded.
* Splitted in two fixes.

Version 3:

* Declaration lines ordered from longest to shortest.
* Checking of event type moved to the top to avoid unnecessary
  locking.

Version 2:

* Completely re-implemented using netdev event handler.

Ilya Maximets (2):
  xdp: hold device for umem regardless of zero-copy mode
  xdp: fix hang while unregistering device bound to xdp socket

 include/net/xdp_sock.h |  5 +++
 net/xdp/xdp_umem.c | 21 +-
 net/xdp/xdp_umem.h |  1 +
 net/xdp/xsk.c  | 87 --
 4 files changed, 93 insertions(+), 21 deletions(-)

-- 
2.17.1



[PATCH bpf v5 2/2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-27 Thread Ilya Maximets
Device that bound to XDP socket will not have zero refcount until the
userspace application will not close it. This leads to hang inside
'netdev_wait_allrefs()' if device unregistering requested:

  # ip link del p1
  < hang on recvmsg on netlink socket >

  # ps -x | grep ip
  5126  pts/0D+   0:00 ip link del p1

  # journalctl -b

  Jun 05 07:19:16 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1

  Jun 05 07:19:27 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1
  ...

Fix that by implementing NETDEV_UNREGISTER event notification handler
to properly clean up all the resources and unref device.

This should also allow socket killing via ss(8) utility.

Fixes: 965a99098443 ("xsk: add support for bind for Rx")
Signed-off-by: Ilya Maximets 
---
 include/net/xdp_sock.h |  5 +++
 net/xdp/xdp_umem.c | 10 ++---
 net/xdp/xdp_umem.h |  1 +
 net/xdp/xsk.c  | 87 --
 4 files changed, 87 insertions(+), 16 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index d074b6d60f8a..82d153a637c7 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -61,6 +61,11 @@ struct xdp_sock {
struct xsk_queue *tx cacheline_aligned_in_smp;
struct list_head list;
bool zc;
+   enum {
+   XSK_UNINITIALIZED = 0,
+   XSK_BINDED,
+   XSK_UNBINDED,
+   } state;
/* Protects multiple processes in the control path */
struct mutex mutex;
/* Mutual exclusion of NAPI TX thread and sendmsg error paths
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 267b82a4cbcf..20c91f02d3d8 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -140,11 +140,13 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
return err;
 }
 
-static void xdp_umem_clear_dev(struct xdp_umem *umem)
+void xdp_umem_clear_dev(struct xdp_umem *umem)
 {
struct netdev_bpf bpf;
int err;
 
+   ASSERT_RTNL();
+
if (!umem->dev)
return;
 
@@ -153,17 +155,13 @@ static void xdp_umem_clear_dev(struct xdp_umem *umem)
bpf.xsk.umem = NULL;
bpf.xsk.queue_id = umem->queue_id;
 
-   rtnl_lock();
err = umem->dev->netdev_ops->ndo_bpf(umem->dev, );
-   rtnl_unlock();
 
if (err)
WARN(1, "failed to disable umem!\n");
}
 
-   rtnl_lock();
xdp_clear_umem_at_qid(umem->dev, umem->queue_id);
-   rtnl_unlock();
 
dev_put(umem->dev);
umem->dev = NULL;
@@ -195,7 +193,9 @@ static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
 
 static void xdp_umem_release(struct xdp_umem *umem)
 {
+   rtnl_lock();
xdp_umem_clear_dev(umem);
+   rtnl_unlock();
 
ida_simple_remove(_ida, umem->id);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 27603227601b..a63a9fb251f5 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -10,6 +10,7 @@
 
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
u16 queue_id, u16 flags);
+void xdp_umem_clear_dev(struct xdp_umem *umem);
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a14e8864e4fa..336723948a36 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -335,6 +335,22 @@ static int xsk_init_queue(u32 entries, struct xsk_queue 
**queue,
return 0;
 }
 
+static void xsk_unbind_dev(struct xdp_sock *xs)
+{
+   struct net_device *dev = xs->dev;
+
+   if (!dev || xs->state != XSK_BINDED)
+   return;
+
+   xs->state = XSK_UNBINDED;
+
+   /* Wait for driver to stop using the xdp socket. */
+   xdp_del_sk_umem(xs->umem, xs);
+   xs->dev = NULL;
+   synchronize_net();
+   dev_put(dev);
+}
+
 static int xsk_release(struct socket *sock)
 {
struct sock *sk = sock->sk;
@@ -354,15 +370,7 @@ static int xsk_release(struct socket *sock)
sock_prot_inuse_add(net, sk->sk_prot, -1);
local_bh_enable();
 
-   if (xs->dev) {
-   struct net_device *dev = xs->dev;
-
-   /* Wait for driver to stop using the xdp socket. */
-   xdp_del_sk_umem(xs->umem, xs);
-   xs->dev = NULL;
-   synchronize_net();
-   dev_put(dev);
-   }
+   xsk_unbind_dev(xs);
 
xskq_destroy(xs->rx);
xskq_destroy(xs->tx);
@@ -412,7 +420,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr 
*addr, int addr_len)
return -EINVAL;
 
mutex_lock(>mutex);
-   if (xs->dev) {
+   if (xs->state != XSK_UNINITIALIZED) {
 

Re: [PATCH bpf v4 2/2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-27 Thread Ilya Maximets
On 26.06.2019 21:34, Jakub Kicinski wrote:
> On Wed, 26 Jun 2019 21:15:15 +0300, Ilya Maximets wrote:
>> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
>> index 267b82a4cbcf..56729e74cbea 100644
>> --- a/net/xdp/xdp_umem.c
>> +++ b/net/xdp/xdp_umem.c
>> @@ -140,34 +140,38 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
>> net_device *dev,
>>  return err;
>>  }
>>  
>> -static void xdp_umem_clear_dev(struct xdp_umem *umem)
>> +void xdp_umem_clear_dev(struct xdp_umem *umem)
>>  {
>> +bool lock = rtnl_is_locked();
> 
> How do you know it's not just locked by someone else?  You need to pass
> the locked state in if this is called from different paths, some of
> which already hold rtnl.

Oh. That's a shame. I need more sleep.

Thanks for spotting. I'll re-work this part.

Best regards, Ilya Maximets.

> 
> Preferably factor the code which needs the lock out into a separate
> function like this:
> 
> void __function()
> {
>   do();
>   the();
>   things();
>   under();
>   the();
>   lock();
> }
> 
> void function()
> {
>   rtnl_lock();
>   __function();
>   rtnl_unlock();
> }
> 
>>  struct netdev_bpf bpf;
>>  int err;
>>  
>> +if (!lock)
>> +rtnl_lock();
> 
> 
> 


[PATCH bpf v4 2/2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-26 Thread Ilya Maximets
Device that bound to XDP socket will not have zero refcount until the
userspace application will not close it. This leads to hang inside
'netdev_wait_allrefs()' if device unregistering requested:

  # ip link del p1
  < hang on recvmsg on netlink socket >

  # ps -x | grep ip
  5126  pts/0D+   0:00 ip link del p1

  # journalctl -b

  Jun 05 07:19:16 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1

  Jun 05 07:19:27 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1
  ...

Fix that by implementing NETDEV_UNREGISTER event notification handler
to properly clean up all the resources and unref device.

This should also allow socket killing via ss(8) utility.

Fixes: 965a99098443 ("xsk: add support for bind for Rx")
Signed-off-by: Ilya Maximets 
---
 include/net/xdp_sock.h |  5 +++
 net/xdp/xdp_umem.c | 16 +---
 net/xdp/xdp_umem.h |  1 +
 net/xdp/xsk.c  | 88 --
 4 files changed, 93 insertions(+), 17 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index d074b6d60f8a..82d153a637c7 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -61,6 +61,11 @@ struct xdp_sock {
struct xsk_queue *tx cacheline_aligned_in_smp;
struct list_head list;
bool zc;
+   enum {
+   XSK_UNINITIALIZED = 0,
+   XSK_BINDED,
+   XSK_UNBINDED,
+   } state;
/* Protects multiple processes in the control path */
struct mutex mutex;
/* Mutual exclusion of NAPI TX thread and sendmsg error paths
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 267b82a4cbcf..56729e74cbea 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -140,34 +140,38 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
return err;
 }
 
-static void xdp_umem_clear_dev(struct xdp_umem *umem)
+void xdp_umem_clear_dev(struct xdp_umem *umem)
 {
+   bool lock = rtnl_is_locked();
struct netdev_bpf bpf;
int err;
 
+   if (!lock)
+   rtnl_lock();
+
if (!umem->dev)
-   return;
+   goto out_unlock;
 
if (umem->zc) {
bpf.command = XDP_SETUP_XSK_UMEM;
bpf.xsk.umem = NULL;
bpf.xsk.queue_id = umem->queue_id;
 
-   rtnl_lock();
err = umem->dev->netdev_ops->ndo_bpf(umem->dev, );
-   rtnl_unlock();
 
if (err)
WARN(1, "failed to disable umem!\n");
}
 
-   rtnl_lock();
xdp_clear_umem_at_qid(umem->dev, umem->queue_id);
-   rtnl_unlock();
 
dev_put(umem->dev);
umem->dev = NULL;
umem->zc = false;
+
+out_unlock:
+   if (!lock)
+   rtnl_unlock();
 }
 
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 27603227601b..a63a9fb251f5 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -10,6 +10,7 @@
 
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
u16 queue_id, u16 flags);
+void xdp_umem_clear_dev(struct xdp_umem *umem);
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a14e8864e4fa..883dfd3cdc49 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -335,6 +335,22 @@ static int xsk_init_queue(u32 entries, struct xsk_queue 
**queue,
return 0;
 }
 
+static void xsk_unbind_dev(struct xdp_sock *xs)
+{
+   struct net_device *dev = xs->dev;
+
+   if (!dev || xs->state != XSK_BINDED)
+   return;
+
+   xs->state = XSK_UNBINDED;
+
+   /* Wait for driver to stop using the xdp socket. */
+   xdp_del_sk_umem(xs->umem, xs);
+   xs->dev = NULL;
+   synchronize_net();
+   dev_put(dev);
+}
+
 static int xsk_release(struct socket *sock)
 {
struct sock *sk = sock->sk;
@@ -354,15 +370,7 @@ static int xsk_release(struct socket *sock)
sock_prot_inuse_add(net, sk->sk_prot, -1);
local_bh_enable();
 
-   if (xs->dev) {
-   struct net_device *dev = xs->dev;
-
-   /* Wait for driver to stop using the xdp socket. */
-   xdp_del_sk_umem(xs->umem, xs);
-   xs->dev = NULL;
-   synchronize_net();
-   dev_put(dev);
-   }
+   xsk_unbind_dev(xs);
 
xskq_destroy(xs->rx);
xskq_destroy(xs->tx);
@@ -412,7 +420,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr 
*addr, int addr_len)
return -EINVAL;
 
mutex_lock(>mutex);
-   if (xs->dev) {
+   if (xs->state != XSK_UNINITIALIZED) {
err = -

[PATCH bpf v4 0/2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-26 Thread Ilya Maximets
Version 4:

* 'xdp_umem_clear_dev' exposed to be used while unregistering.
* Added XDP socket state to track if resources already unbinded.
* Splitted in two fixes.

Version 3:

* Declaration lines ordered from longest to shortest.
* Checking of event type moved to the top to avoid unnecessary
  locking.

Version 2:

* Completely re-implemented using netdev event handler.

Ilya Maximets (2):
  xdp: hold device for umem regardless of zero-copy mode
  xdp: fix hang while unregistering device bound to xdp socket

 include/net/xdp_sock.h |  5 +++
 net/xdp/xdp_umem.c | 27 +++--
 net/xdp/xdp_umem.h |  1 +
 net/xdp/xsk.c  | 88 --
 4 files changed, 99 insertions(+), 22 deletions(-)

-- 
2.17.1



[PATCH bpf v4 1/2] xdp: hold device for umem regardless of zero-copy mode

2019-06-26 Thread Ilya Maximets
Device pointer stored in umem regardless of zero-copy mode,
so we heed to hold the device in all cases.

Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy 
on one queue id")
Signed-off-by: Ilya Maximets 
---
 net/xdp/xdp_umem.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 9c6de4f114f8..267b82a4cbcf 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -105,6 +105,9 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
 
umem->dev = dev;
umem->queue_id = queue_id;
+
+   dev_hold(dev);
+
if (force_copy)
/* For copy-mode, we are done. */
goto out_rtnl_unlock;
@@ -124,7 +127,6 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
goto err_unreg_umem;
rtnl_unlock();
 
-   dev_hold(dev);
umem->zc = true;
return 0;
 
@@ -163,10 +165,9 @@ static void xdp_umem_clear_dev(struct xdp_umem *umem)
xdp_clear_umem_at_qid(umem->dev, umem->queue_id);
rtnl_unlock();
 
-   if (umem->zc) {
-   dev_put(umem->dev);
-   umem->zc = false;
-   }
+   dev_put(umem->dev);
+   umem->dev = NULL;
+   umem->zc = false;
 }
 
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
-- 
2.17.1



Re: [PATCH bpf v3] xdp: fix hang while unregistering device bound to xdp socket

2019-06-11 Thread Ilya Maximets
On 11.06.2019 15:13, Björn Töpel wrote:
> On Tue, 11 Jun 2019 at 10:42, Ilya Maximets  wrote:
>>
>> On 11.06.2019 11:09, Björn Töpel wrote:
>>> On Mon, 10 Jun 2019 at 22:49, Jonathan Lemon  
>>> wrote:
>>>>
>>>> On 10 Jun 2019, at 9:15, Ilya Maximets wrote:
>>>>
>>>>> Device that bound to XDP socket will not have zero refcount until the
>>>>> userspace application will not close it. This leads to hang inside
>>>>> 'netdev_wait_allrefs()' if device unregistering requested:
>>>>>
>>>>>   # ip link del p1
>>>>>   < hang on recvmsg on netlink socket >
>>>>>
>>>>>   # ps -x | grep ip
>>>>>   5126  pts/0D+   0:00 ip link del p1
>>>>>
>>>>>   # journalctl -b
>>>>>
>>>>>   Jun 05 07:19:16 kernel:
>>>>>   unregister_netdevice: waiting for p1 to become free. Usage count = 1
>>>>>
>>>>>   Jun 05 07:19:27 kernel:
>>>>>   unregister_netdevice: waiting for p1 to become free. Usage count = 1
>>>>>   ...
>>>>>
>>>>> Fix that by implementing NETDEV_UNREGISTER event notification handler
>>>>> to properly clean up all the resources and unref device.
>>>>>
>>>>> This should also allow socket killing via ss(8) utility.
>>>>>
>>>>> Fixes: 965a99098443 ("xsk: add support for bind for Rx")
>>>>> Signed-off-by: Ilya Maximets 
>>>>> ---
>>>>>
>>>>> Version 3:
>>>>>
>>>>> * Declaration lines ordered from longest to shortest.
>>>>> * Checking of event type moved to the top to avoid unnecessary
>>>>>   locking.
>>>>>
>>>>> Version 2:
>>>>>
>>>>> * Completely re-implemented using netdev event handler.
>>>>>
>>>>>  net/xdp/xsk.c | 65
>>>>> ++-
>>>>>  1 file changed, 64 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>>>> index a14e8864e4fa..273a419a8c4d 100644
>>>>> --- a/net/xdp/xsk.c
>>>>> +++ b/net/xdp/xsk.c
>>>>> @@ -693,6 +693,57 @@ static int xsk_mmap(struct file *file, struct
>>>>> socket *sock,
>>>>>  size, vma->vm_page_prot);
>>>>>  }
>>>>>
>>>>> +static int xsk_notifier(struct notifier_block *this,
>>>>> + unsigned long msg, void *ptr)
>>>>> +{
>>>>> + struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>>>>> + struct net *net = dev_net(dev);
>>>>> + int i, unregister_count = 0;
>>>>> + struct sock *sk;
>>>>> +
>>>>> + switch (msg) {
>>>>> + case NETDEV_UNREGISTER:
>>>>> + mutex_lock(>xdp.lock);
>>>>
>>>> The call is under the rtnl lock, and we're not modifying
>>>> the list, so this mutex shouldn't be needed.
>>>>
>>>
>>> The list can, however, be modified outside the rtnl lock (e.g. at
>>> socket creation). AFAIK the hlist cannot be traversed lock-less,
>>> right?
>>
>> We could use 'rcu_read_lock' instead and iterate with 'sk_for_each_rcu',
>> but we'll not be able to synchronize inside.
>>
>>>
>>>>
>>>>> + sk_for_each(sk, >xdp.list) {
>>>>> + struct xdp_sock *xs = xdp_sk(sk);
>>>>> +
>>>>> + mutex_lock(>mutex);
>>>>> + if (dev != xs->dev) {
>>>>> + mutex_unlock(>mutex);
>>>>> + continue;
>>>>> + }
>>>>> +
>>>>> + sk->sk_err = ENETDOWN;
>>>>> + if (!sock_flag(sk, SOCK_DEAD))
>>>>> + sk->sk_error_report(sk);
>>>>> +
>>>>> + /* Wait for driver to stop using the xdp socket. */
>>>>> + xdp_del_sk_umem(xs->umem, xs);
>>>>> + xs->dev = NULL;
>>>>> +

Re: [PATCH bpf v3] xdp: fix hang while unregistering device bound to xdp socket

2019-06-11 Thread Ilya Maximets
On 11.06.2019 11:09, Björn Töpel wrote:
> On Mon, 10 Jun 2019 at 22:49, Jonathan Lemon  wrote:
>>
>> On 10 Jun 2019, at 9:15, Ilya Maximets wrote:
>>
>>> Device that bound to XDP socket will not have zero refcount until the
>>> userspace application will not close it. This leads to hang inside
>>> 'netdev_wait_allrefs()' if device unregistering requested:
>>>
>>>   # ip link del p1
>>>   < hang on recvmsg on netlink socket >
>>>
>>>   # ps -x | grep ip
>>>   5126  pts/0D+   0:00 ip link del p1
>>>
>>>   # journalctl -b
>>>
>>>   Jun 05 07:19:16 kernel:
>>>   unregister_netdevice: waiting for p1 to become free. Usage count = 1
>>>
>>>   Jun 05 07:19:27 kernel:
>>>   unregister_netdevice: waiting for p1 to become free. Usage count = 1
>>>   ...
>>>
>>> Fix that by implementing NETDEV_UNREGISTER event notification handler
>>> to properly clean up all the resources and unref device.
>>>
>>> This should also allow socket killing via ss(8) utility.
>>>
>>> Fixes: 965a99098443 ("xsk: add support for bind for Rx")
>>> Signed-off-by: Ilya Maximets 
>>> ---
>>>
>>> Version 3:
>>>
>>> * Declaration lines ordered from longest to shortest.
>>> * Checking of event type moved to the top to avoid unnecessary
>>>   locking.
>>>
>>> Version 2:
>>>
>>> * Completely re-implemented using netdev event handler.
>>>
>>>  net/xdp/xsk.c | 65
>>> ++-
>>>  1 file changed, 64 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>> index a14e8864e4fa..273a419a8c4d 100644
>>> --- a/net/xdp/xsk.c
>>> +++ b/net/xdp/xsk.c
>>> @@ -693,6 +693,57 @@ static int xsk_mmap(struct file *file, struct
>>> socket *sock,
>>>  size, vma->vm_page_prot);
>>>  }
>>>
>>> +static int xsk_notifier(struct notifier_block *this,
>>> + unsigned long msg, void *ptr)
>>> +{
>>> + struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>>> + struct net *net = dev_net(dev);
>>> + int i, unregister_count = 0;
>>> + struct sock *sk;
>>> +
>>> + switch (msg) {
>>> + case NETDEV_UNREGISTER:
>>> + mutex_lock(>xdp.lock);
>>
>> The call is under the rtnl lock, and we're not modifying
>> the list, so this mutex shouldn't be needed.
>>
> 
> The list can, however, be modified outside the rtnl lock (e.g. at
> socket creation). AFAIK the hlist cannot be traversed lock-less,
> right?

We could use 'rcu_read_lock' instead and iterate with 'sk_for_each_rcu',
but we'll not be able to synchronize inside.

> 
>>
>>> + sk_for_each(sk, >xdp.list) {
>>> + struct xdp_sock *xs = xdp_sk(sk);
>>> +
>>> + mutex_lock(>mutex);
>>> + if (dev != xs->dev) {
>>> + mutex_unlock(>mutex);
>>> + continue;
>>> + }
>>> +
>>> + sk->sk_err = ENETDOWN;
>>> + if (!sock_flag(sk, SOCK_DEAD))
>>> + sk->sk_error_report(sk);
>>> +
>>> + /* Wait for driver to stop using the xdp socket. */
>>> + xdp_del_sk_umem(xs->umem, xs);
>>> + xs->dev = NULL;
>>> + synchronize_net();
>> Isn't this by handled by the unregister_count case below?
>>
> 
> To clarify, setting dev to NULL and xdp_del_sk_umem() + sync makes
> sure that a driver doesn't touch the Tx and Rx rings. Nothing can be
> assumed about completion + fill ring (umem), until zero-copy has been
> disabled via ndo_bpf.
> 
>>> +
>>> + /* Clear device references in umem. */
>>> + xdp_put_umem(xs->umem);
>>> + xs->umem = NULL;
>>
>> This makes me uneasy.  We need to unregister the umem from
>> the device (xdp_umem_clear_dev()) but this can remove the umem
>> pages out from underneath the xsk.
>>
> 
> Yes, this is scary. The socket is alive, and userland typically has
> the fill/completion rings mmapped. Then the umem r

Re: [PATCH bpf v2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-10 Thread Ilya Maximets
On 10.06.2019 11:05, Ilya Maximets wrote:
> On 08.06.2019 2:31, Jakub Kicinski wrote:
>> On Fri,  7 Jun 2019 20:31:43 +0300, Ilya Maximets wrote:
>>> +static int xsk_notifier(struct notifier_block *this,
>>> +   unsigned long msg, void *ptr)
>>> +{
>>> +   struct sock *sk;
>>> +   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>>> +   struct net *net = dev_net(dev);
>>> +   int i, unregister_count = 0;
>>
>> Please order the var declaration lines longest to shortest.
>> (reverse christmas tree)
> 
> Hi.
> I'm not a fan of mixing 'struct's with bare types in the declarations.
> Moving the 'sk' to the third place will make a hole like this:
> 
>   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>   struct net *net = dev_net(dev);
>   struct sock *sk;
>   int i, unregister_count = 0;
> 
> Which is not looking good.
> Moving to the 4th place:
> 
>   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>   struct net *net = dev_net(dev);
>   int i, unregister_count = 0;
>   struct sock *sk;

I've sent v3 with this variant and with moved msg check to the top level.

> 
> This variant doesn't look good for me because of mixing 'struct's with
> bare integers.
> 
> Do you think I need to use one of above variants?
> 
>>
>>> +   mutex_lock(>xdp.lock);
>>> +   sk_for_each(sk, >xdp.list) {
>>> +   struct xdp_sock *xs = xdp_sk(sk);
>>> +
>>> +   mutex_lock(>mutex);
>>> +   switch (msg) {
>>> +   case NETDEV_UNREGISTER:
>>
>> You should probably check the msg type earlier and not take all the
>> locks and iterate for other types..
> 
> Yeah. I thought about it too. Will fix in the next version.
> 
> Best regards, Ilya Maximets.
> 


[PATCH bpf v3] xdp: fix hang while unregistering device bound to xdp socket

2019-06-10 Thread Ilya Maximets
Device that bound to XDP socket will not have zero refcount until the
userspace application will not close it. This leads to hang inside
'netdev_wait_allrefs()' if device unregistering requested:

  # ip link del p1
  < hang on recvmsg on netlink socket >

  # ps -x | grep ip
  5126  pts/0D+   0:00 ip link del p1

  # journalctl -b

  Jun 05 07:19:16 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1

  Jun 05 07:19:27 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1
  ...

Fix that by implementing NETDEV_UNREGISTER event notification handler
to properly clean up all the resources and unref device.

This should also allow socket killing via ss(8) utility.

Fixes: 965a99098443 ("xsk: add support for bind for Rx")
Signed-off-by: Ilya Maximets 
---

Version 3:

* Declaration lines ordered from longest to shortest.
* Checking of event type moved to the top to avoid unnecessary
  locking.

Version 2:

* Completely re-implemented using netdev event handler.

 net/xdp/xsk.c | 65 ++-
 1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a14e8864e4fa..273a419a8c4d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -693,6 +693,57 @@ static int xsk_mmap(struct file *file, struct socket *sock,
   size, vma->vm_page_prot);
 }
 
+static int xsk_notifier(struct notifier_block *this,
+   unsigned long msg, void *ptr)
+{
+   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+   struct net *net = dev_net(dev);
+   int i, unregister_count = 0;
+   struct sock *sk;
+
+   switch (msg) {
+   case NETDEV_UNREGISTER:
+   mutex_lock(>xdp.lock);
+   sk_for_each(sk, >xdp.list) {
+   struct xdp_sock *xs = xdp_sk(sk);
+
+   mutex_lock(>mutex);
+   if (dev != xs->dev) {
+   mutex_unlock(>mutex);
+   continue;
+   }
+
+   sk->sk_err = ENETDOWN;
+   if (!sock_flag(sk, SOCK_DEAD))
+   sk->sk_error_report(sk);
+
+   /* Wait for driver to stop using the xdp socket. */
+   xdp_del_sk_umem(xs->umem, xs);
+   xs->dev = NULL;
+   synchronize_net();
+
+   /* Clear device references in umem. */
+   xdp_put_umem(xs->umem);
+   xs->umem = NULL;
+
+   mutex_unlock(>mutex);
+   unregister_count++;
+   }
+   mutex_unlock(>xdp.lock);
+
+   if (unregister_count) {
+   /* Wait for umem clearing completion. */
+   synchronize_net();
+   for (i = 0; i < unregister_count; i++)
+   dev_put(dev);
+   }
+
+   break;
+   }
+
+   return NOTIFY_DONE;
+}
+
 static struct proto xsk_proto = {
.name = "XDP",
.owner =THIS_MODULE,
@@ -727,7 +778,8 @@ static void xsk_destruct(struct sock *sk)
if (!sock_flag(sk, SOCK_DEAD))
return;
 
-   xdp_put_umem(xs->umem);
+   if (xs->umem)
+   xdp_put_umem(xs->umem);
 
sk_refcnt_debug_dec(sk);
 }
@@ -784,6 +836,10 @@ static const struct net_proto_family xsk_family_ops = {
.owner  = THIS_MODULE,
 };
 
+static struct notifier_block xsk_netdev_notifier = {
+   .notifier_call  = xsk_notifier,
+};
+
 static int __net_init xsk_net_init(struct net *net)
 {
mutex_init(>xdp.lock);
@@ -816,8 +872,15 @@ static int __init xsk_init(void)
err = register_pernet_subsys(_net_ops);
if (err)
goto out_sk;
+
+   err = register_netdevice_notifier(_netdev_notifier);
+   if (err)
+   goto out_pernet;
+
return 0;
 
+out_pernet:
+   unregister_pernet_subsys(_net_ops);
 out_sk:
sock_unregister(PF_XDP);
 out_proto:
-- 
2.17.1



Re: [PATCH bpf v2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-10 Thread Ilya Maximets
On 08.06.2019 2:31, Jakub Kicinski wrote:
> On Fri,  7 Jun 2019 20:31:43 +0300, Ilya Maximets wrote:
>> +static int xsk_notifier(struct notifier_block *this,
>> +unsigned long msg, void *ptr)
>> +{
>> +struct sock *sk;
>> +struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>> +struct net *net = dev_net(dev);
>> +int i, unregister_count = 0;
> 
> Please order the var declaration lines longest to shortest.
> (reverse christmas tree)

Hi.
I'm not a fan of mixing 'struct's with bare types in the declarations.
Moving the 'sk' to the third place will make a hole like this:

struct net_device *dev = netdev_notifier_info_to_dev(ptr);
struct net *net = dev_net(dev);
struct sock *sk;
int i, unregister_count = 0;

Which is not looking good.
Moving to the 4th place:

struct net_device *dev = netdev_notifier_info_to_dev(ptr);
struct net *net = dev_net(dev);
int i, unregister_count = 0;
struct sock *sk;

This variant doesn't look good for me because of mixing 'struct's with
bare integers.

Do you think I need to use one of above variants?

> 
>> +mutex_lock(>xdp.lock);
>> +sk_for_each(sk, >xdp.list) {
>> +struct xdp_sock *xs = xdp_sk(sk);
>> +
>> +mutex_lock(>mutex);
>> +switch (msg) {
>> +case NETDEV_UNREGISTER:
> 
> You should probably check the msg type earlier and not take all the
> locks and iterate for other types..

Yeah. I thought about it too. Will fix in the next version.

Best regards, Ilya Maximets.


[PATCH bpf v2] xdp: fix hang while unregistering device bound to xdp socket

2019-06-07 Thread Ilya Maximets
Device that bound to XDP socket will not have zero refcount until the
userspace application will not close it. This leads to hang inside
'netdev_wait_allrefs()' if device unregistering requested:

  # ip link del p1
  < hang on recvmsg on netlink socket >

  # ps -x | grep ip
  5126  pts/0D+   0:00 ip link del p1

  # journalctl -b

  Jun 05 07:19:16 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1

  Jun 05 07:19:27 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1
  ...

Fix that by implementing NETDEV_UNREGISTER event notification handler
to properly clean up all the resources and unref device.

This should also allow socket killing via ss(8) utility.

Fixes: 965a99098443 ("xsk: add support for bind for Rx")
Signed-off-by: Ilya Maximets 
---
 net/xdp/xsk.c | 62 ++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a14e8864e4fa..3f3979579d21 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -693,6 +693,54 @@ static int xsk_mmap(struct file *file, struct socket *sock,
   size, vma->vm_page_prot);
 }
 
+static int xsk_notifier(struct notifier_block *this,
+   unsigned long msg, void *ptr)
+{
+   struct sock *sk;
+   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+   struct net *net = dev_net(dev);
+   int i, unregister_count = 0;
+
+   mutex_lock(>xdp.lock);
+   sk_for_each(sk, >xdp.list) {
+   struct xdp_sock *xs = xdp_sk(sk);
+
+   mutex_lock(>mutex);
+   switch (msg) {
+   case NETDEV_UNREGISTER:
+   if (dev != xs->dev)
+   break;
+
+   sk->sk_err = ENETDOWN;
+   if (!sock_flag(sk, SOCK_DEAD))
+   sk->sk_error_report(sk);
+
+   /* Wait for driver to stop using the xdp socket. */
+   xdp_del_sk_umem(xs->umem, xs);
+   xs->dev = NULL;
+   synchronize_net();
+
+   /* Clear device references in umem. */
+   xdp_put_umem(xs->umem);
+   xs->umem = NULL;
+
+   unregister_count++;
+   break;
+   }
+   mutex_unlock(>mutex);
+   }
+   mutex_unlock(>xdp.lock);
+
+   if (unregister_count) {
+   /* Wait for umem clearing completion. */
+   synchronize_net();
+   for (i = 0; i < unregister_count; i++)
+   dev_put(dev);
+   }
+
+   return NOTIFY_DONE;
+}
+
 static struct proto xsk_proto = {
.name = "XDP",
.owner =THIS_MODULE,
@@ -727,7 +775,8 @@ static void xsk_destruct(struct sock *sk)
if (!sock_flag(sk, SOCK_DEAD))
return;
 
-   xdp_put_umem(xs->umem);
+   if (xs->umem)
+   xdp_put_umem(xs->umem);
 
sk_refcnt_debug_dec(sk);
 }
@@ -784,6 +833,10 @@ static const struct net_proto_family xsk_family_ops = {
.owner  = THIS_MODULE,
 };
 
+static struct notifier_block xsk_netdev_notifier = {
+   .notifier_call  = xsk_notifier,
+};
+
 static int __net_init xsk_net_init(struct net *net)
 {
mutex_init(>xdp.lock);
@@ -816,8 +869,15 @@ static int __init xsk_init(void)
err = register_pernet_subsys(_net_ops);
if (err)
goto out_sk;
+
+   err = register_netdevice_notifier(_netdev_notifier);
+   if (err)
+   goto out_pernet;
+
return 0;
 
+out_pernet:
+   unregister_pernet_subsys(_net_ops);
 out_sk:
sock_unregister(PF_XDP);
 out_proto:
-- 
2.17.1



[PATCH bpf] xdp: check device pointer before clearing

2019-06-07 Thread Ilya Maximets
We should not call 'ndo_bpf()' or 'dev_put()' with NULL argument.

Fixes: c9b47cc1fabc ("xsk: fix bug when trying to use both copy and zero-copy 
on one queue id")
Signed-off-by: Ilya Maximets 
---

I'm not sure if this fixes any real NULL pointer dereference, but code
is not consistent anyway and should be fixed.

 net/xdp/xdp_umem.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 2b18223e7eb8..9c6de4f114f8 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -143,6 +143,9 @@ static void xdp_umem_clear_dev(struct xdp_umem *umem)
struct netdev_bpf bpf;
int err;
 
+   if (!umem->dev)
+   return;
+
if (umem->zc) {
bpf.command = XDP_SETUP_XSK_UMEM;
bpf.xsk.umem = NULL;
@@ -156,11 +159,9 @@ static void xdp_umem_clear_dev(struct xdp_umem *umem)
WARN(1, "failed to disable umem!\n");
}
 
-   if (umem->dev) {
-   rtnl_lock();
-   xdp_clear_umem_at_qid(umem->dev, umem->queue_id);
-   rtnl_unlock();
-   }
+   rtnl_lock();
+   xdp_clear_umem_at_qid(umem->dev, umem->queue_id);
+   rtnl_unlock();
 
if (umem->zc) {
dev_put(umem->dev);
-- 
2.17.1



Re: [PATCH] net: Fix hang while unregistering device bound to xdp socket

2019-06-07 Thread Ilya Maximets
On 06.06.2019 21:03, Jonathan Lemon wrote:
> On 6 Jun 2019, at 5:40, Ilya Maximets wrote:
> 
>> Device that bound to XDP socket will not have zero refcount until the
>> userspace application will not close it. This leads to hang inside
>> 'netdev_wait_allrefs()' if device unregistering requested:
>>
>>   # ip link del p1
>>   < hang on recvmsg on netlink socket >
>>
>>   # ps -x | grep ip
>>   5126  pts/0D+   0:00 ip link del p1
>>
>>   # journalctl -b
>>
>>   Jun 05 07:19:16 kernel:
>>   unregister_netdevice: waiting for p1 to become free. Usage count = 1
>>
>>   Jun 05 07:19:27 kernel:
>>   unregister_netdevice: waiting for p1 to become free. Usage count = 1
>>   ...
>>
>> Fix that by counting XDP references for the device and failing
>> RTM_DELLINK with EBUSY if device is still in use by any XDP socket.
>>
>> With this change:
>>
>>   # ip link del p1
>>   RTNETLINK answers: Device or resource busy
>>
>> Fixes: 965a99098443 ("xsk: add support for bind for Rx")
>> Signed-off-by: Ilya Maximets 
>> ---
>>
>> Another option could be to force closing all the corresponding AF_XDP
>> sockets, but I didn't figure out how to do this properly yet.
>>
>>  include/linux/netdevice.h | 25 +
>>  net/core/dev.c| 10 ++
>>  net/core/rtnetlink.c  |  6 ++
>>  net/xdp/xsk.c |  7 ++-
>>  4 files changed, 47 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 44b47e9df94a..24451cfc5590 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1705,6 +1705,7 @@ enum netdev_priv_flags {
>>   *  @watchdog_timer:List of timers
>>   *
>>   *  @pcpu_refcnt:   Number of references to this device
>> + *  @pcpu_xdp_refcnt:   Number of XDP socket references to this device
>>   *  @todo_list: Delayed register/unregister
>>   *  @link_watch_list:   XXX: need comments on this one
>>   *
>> @@ -1966,6 +1967,7 @@ struct net_device {
>>  struct timer_list   watchdog_timer;
>>
>>  int __percpu*pcpu_refcnt;
>> +int __percpu*pcpu_xdp_refcnt;
>>  struct list_headtodo_list;
> 
> 
> I understand the intention here, but don't think that putting a XDP reference
> into the generic netdev structure is the right way of doing this.  Likely the
> NETDEV_UNREGISTER notifier should be used so the socket and umem unbinds from
> the device.
> 

Thanks for the pointer! That is exactly what I looked for.
I'll make a new version that will unbind resources using netdevice notifier.

Best regards, Ilya Maximets.


[PATCH] net: Fix hang while unregistering device bound to xdp socket

2019-06-06 Thread Ilya Maximets
Device that bound to XDP socket will not have zero refcount until the
userspace application will not close it. This leads to hang inside
'netdev_wait_allrefs()' if device unregistering requested:

  # ip link del p1
  < hang on recvmsg on netlink socket >

  # ps -x | grep ip
  5126  pts/0D+   0:00 ip link del p1

  # journalctl -b

  Jun 05 07:19:16 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1

  Jun 05 07:19:27 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1
  ...

Fix that by counting XDP references for the device and failing
RTM_DELLINK with EBUSY if device is still in use by any XDP socket.

With this change:

  # ip link del p1
  RTNETLINK answers: Device or resource busy

Fixes: 965a99098443 ("xsk: add support for bind for Rx")
Signed-off-by: Ilya Maximets 
---

Another option could be to force closing all the corresponding AF_XDP
sockets, but I didn't figure out how to do this properly yet.

 include/linux/netdevice.h | 25 +
 net/core/dev.c| 10 ++
 net/core/rtnetlink.c  |  6 ++
 net/xdp/xsk.c |  7 ++-
 4 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 44b47e9df94a..24451cfc5590 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1705,6 +1705,7 @@ enum netdev_priv_flags {
  * @watchdog_timer:List of timers
  *
  * @pcpu_refcnt:   Number of references to this device
+ * @pcpu_xdp_refcnt:   Number of XDP socket references to this device
  * @todo_list: Delayed register/unregister
  * @link_watch_list:   XXX: need comments on this one
  *
@@ -1966,6 +1967,7 @@ struct net_device {
struct timer_list   watchdog_timer;
 
int __percpu*pcpu_refcnt;
+   int __percpu*pcpu_xdp_refcnt;
struct list_headtodo_list;
 
struct list_headlink_watch_list;
@@ -2636,6 +2638,7 @@ static inline void unregister_netdevice(struct net_device 
*dev)
 }
 
 int netdev_refcnt_read(const struct net_device *dev);
+int netdev_xdp_refcnt_read(const struct net_device *dev);
 void free_netdev(struct net_device *dev);
 void netdev_freemem(struct net_device *dev);
 void synchronize_net(void);
@@ -3739,6 +3742,28 @@ static inline void dev_hold(struct net_device *dev)
this_cpu_inc(*dev->pcpu_refcnt);
 }
 
+/**
+ * dev_put_xdp - release xdp reference to device
+ * @dev: network device
+ *
+ * Decrease the reference counter of XDP sockets bound to device.
+ */
+static inline void dev_put_xdp(struct net_device *dev)
+{
+   this_cpu_dec(*dev->pcpu_xdp_refcnt);
+}
+
+/**
+ * dev_hold_xdp - get xdp reference to device
+ * @dev: network device
+ *
+ * Increase the reference counter of XDP sockets bound to device.
+ */
+static inline void dev_hold_xdp(struct net_device *dev)
+{
+   this_cpu_inc(*dev->pcpu_xdp_refcnt);
+}
+
 /* Carrier loss detection, dial on demand. The functions netif_carrier_on
  * and _off may be called from IRQ context, but it is caller
  * who is responsible for serialization of these calls.
diff --git a/net/core/dev.c b/net/core/dev.c
index 66f7508825bd..f6f7cf3d8e93 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8840,6 +8840,16 @@ int netdev_refcnt_read(const struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_refcnt_read);
 
+int netdev_xdp_refcnt_read(const struct net_device *dev)
+{
+   int i, refcnt = 0;
+
+   for_each_possible_cpu(i)
+   refcnt += *per_cpu_ptr(dev->pcpu_xdp_refcnt, i);
+   return refcnt;
+}
+EXPORT_SYMBOL(netdev_xdp_refcnt_read);
+
 /**
  * netdev_wait_allrefs - wait until all references are gone.
  * @dev: target net_device
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index adcc045952c2..f88bf52d41b3 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2777,6 +2777,9 @@ static int rtnl_group_dellink(const struct net *net, int 
group)
ops = dev->rtnl_link_ops;
if (!ops || !ops->dellink)
return -EOPNOTSUPP;
+
+   if (netdev_xdp_refcnt_read(dev))
+   return -EBUSY;
}
}
 
@@ -2805,6 +2808,9 @@ int rtnl_delete_link(struct net_device *dev)
if (!ops || !ops->dellink)
return -EOPNOTSUPP;
 
+   if (netdev_xdp_refcnt_read(dev))
+   return -EBUSY;
+
ops->dellink(dev, _kill);
unregister_netdevice_many(_kill);
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a14e8864e4fa..215cc8712b8d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -361,6 +361,7 @@ static int xsk_release(struct socket *sock)
xdp_del_sk_umem(xs->umem, xs);
xs->dev = NULL;
synchronize_net();
+