Re: [PATCH v2 net] mlx4_core: restore optimal ICM memory allocation

2018-05-31 Thread Qing Huang




On 5/31/2018 5:52 AM, Eric Dumazet wrote:

Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
brought two regressions caught in our regression suite.

The big one is an additional cost of 256 bytes of overhead per 4096 bytes,
or 6.25 % which is unacceptable since ICM can be pretty large.

This comes from having to allocate one struct mlx4_icm_chunk (256 bytes)
per MLX4_TABLE_CHUNK, which the buggy commit shrank to 4KB
(instead of prior 256KB)


It would be great if you could share the test case that triggered the 
KASAN report in your
environment. Our QA has been running intensive tests using 8KB or 4KB 
chunk size configuration
for some time, no one has reported memory corruption issues so far, 
given that we are not

running the latest upstream kernel.

IMO, it's worthwhile to find out the root cause and fix the problem.




Note that mlx4_alloc_icm() is already able to try high order allocations
and fallback to low-order allocations under high memory pressure.

Most of these allocations happen right after boot time, when we get
plenty of non fragmented memory, there is really no point being so
pessimistic and break huge pages into order-0 ones just for fun.

We only have to tweak gfp_mask a bit, to help falling back faster,
without risking OOM killings.


Just FYI, out of memory wasn't our original concern. We didn't encounter 
OOM killings.





Second regression is an KASAN fault, that will need further investigations.

Fixes: 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
Signed-off-by: Eric Dumazet 
Acked-by: Tariq Toukan 
Cc: John Sperbeck 
Cc: Tarick Bedeir 
Cc: Qing Huang 
Cc: Daniel Jurgens 
Cc: Zhu Yanjun 
---
  drivers/net/ethernet/mellanox/mlx4/icm.c | 18 --
  1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 
685337d58276fc91baeeb64387c52985e1bc6dda..5342bd8a3d0bfaa9e76bb9b6943790606c97b181
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,13 @@
  #include "fw.h"
  
  /*

- * We allocate in page size (default 4KB on many archs) chunks to avoid high
- * order memory allocations in fragmented/high usage memory situation.
+ * We allocate in as big chunks as we can, up to a maximum of 256 KB
+ * per chunk. Note that the chunks are not necessarily in contiguous
+ * physical memory.
   */
  enum {
-   MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
-   MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
+   MLX4_ICM_ALLOC_SIZE = 1 << 18,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << 18,
  };
  
  static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)

@@ -135,6 +136,7 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int 
npages,
struct mlx4_icm *icm;
struct mlx4_icm_chunk *chunk = NULL;
int cur_order;
+   gfp_t mask;
int ret;
  
  	/* We use sg_set_buf for coherent allocs, which assumes low memory */

@@ -178,13 +180,17 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int 
npages,
while (1 << cur_order > npages)
--cur_order;
  
+		mask = gfp_mask;

+   if (cur_order)
+   mask &= ~__GFP_DIRECT_RECLAIM;
+
if (coherent)
ret = mlx4_alloc_icm_coherent(&dev->persist->pdev->dev,
  
&chunk->mem[chunk->npages],
- cur_order, gfp_mask);
+ cur_order, mask);
else
ret = mlx4_alloc_icm_pages(&chunk->mem[chunk->npages],
-  cur_order, gfp_mask,
+  cur_order, mask,
   dev->numa_node);
  
  		if (ret) {




Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation

2018-05-30 Thread Qing Huang




On 5/30/2018 2:30 PM, Eric Dumazet wrote:

On Wed, May 30, 2018 at 5:08 PM Qing Huang  wrote:


On 5/30/2018 1:50 PM, Eric Dumazet wrote:

On Wed, May 30, 2018 at 4:30 PM Qing Huang  wrote:

On 5/29/2018 9:11 PM, Eric Dumazet wrote:

Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
brought a regression caught in our regression suite, thanks to KASAN.

If KASAN reported issue was really caused by smaller chunk sizes,
changing allocation
order dynamically will eventually hit the same issue.

Sigh, you have little idea of what your patch really did...

The KASAN part only shows the tip of the iceberg, but our main concern
is an increase of memory overhead.

Well, the commit log only mentioned KASAN and but the change here didn't
seem to solve
the issue.

Can you elaborate ?

My patch solves our problems.

Both the memory overhead and KASAN splats are gone.


If KASAN issue was triggered by using smaller chunks, when under memory 
pressure with lots of fragments,
low order memory allocation will do the similar things. So perhaps in 
your test env, memory allocation
and usage is relatively static, that's probably why using larger chunks 
didn't really utilize low order

allocation code path hence no KASAN issue was spotted.

Smaller chunk size in the mlx4 driver is not supposed to cause any 
memory corruption. We will probably
need to continue to investigate this. Can you provide the test command 
that trigger this issue when running
KASAN kernel so we can try to reproduce it in our lab? It could be that 
upstream code is missing some other

fixes.



Alternative is to revert your patch, since we are now very late in 4.17 cycle.

Memory usage has grown a lot with your patch, since each 4KB page needs a full
struct mlx4_icm_chunk (256 bytes of overhead !)

Going to smaller chunks will have some overhead. It depends on the
application though.
What's the total increased memory consumption in your env?

As I explained, your patch adds 256 bytes of overhead per 4KB.

Your changelog did not mentioned that at all, and we discovered this
the hard way.


If you have some concern regarding memory usage, you should bring this 
up during code review.


Repeated failure and retry for lower order allocations could be bad for 
latency too. This wasn't

mentioned in this commit either.

Like I said, how much overhead really depends on the application. 256 
bytes x chunks may not be

significant on a server with lots of memory.


That is pretty intolerable, and is a blocker for us, memory is precious.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html




Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation

2018-05-30 Thread Qing Huang




On 5/30/2018 1:50 PM, Eric Dumazet wrote:

On Wed, May 30, 2018 at 4:30 PM Qing Huang  wrote:



On 5/29/2018 9:11 PM, Eric Dumazet wrote:

Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
brought a regression caught in our regression suite, thanks to KASAN.

If KASAN reported issue was really caused by smaller chunk sizes,
changing allocation
order dynamically will eventually hit the same issue.

Sigh, you have little idea of what your patch really did...

The KASAN part only shows the tip of the iceberg, but our main concern
is an increase of memory overhead.


Well, the commit log only mentioned KASAN and but the change here didn't 
seem to solve

the issue.



Alternative is to revert your patch, since we are now very late in 4.17 cycle.

Memory usage has grown a lot with your patch, since each 4KB page needs a full
struct mlx4_icm_chunk (256 bytes of overhead !)


Going to smaller chunks will have some overhead. It depends on the 
application though.

What's the total increased memory consumption in your env?



Really we have no choice here, your patch went too far and increased
memory consumption quite a lot.






My patch is simply the best way to address your original concern, and
not increase overall overhead.

( each struct mlx4_icm_chunk should be able to store
MLX4_ICM_CHUNK_LEN pages, instead of one page of 4KB )




Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation

2018-05-30 Thread Qing Huang




On 5/29/2018 9:11 PM, Eric Dumazet wrote:

Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
brought a regression caught in our regression suite, thanks to KASAN.


If KASAN reported issue was really caused by smaller chunk sizes, 
changing allocation

order dynamically will eventually hit the same issue.



Note that mlx4_alloc_icm() is already able to try high order allocations
and fallback to low-order allocations under high memory pressure.

We only have to tweak gfp_mask a bit, to help falling back faster,
without risking OOM killings.

BUG: KASAN: slab-out-of-bounds in to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib]
Read of size 4 at addr 8817df584f68 by task qp_listing_test/92585

CPU: 38 PID: 92585 Comm: qp_listing_test Tainted: G   O
Call Trace:
  [] dump_stack+0x4d/0x72
  [] print_address_description+0x6f/0x260
  [] kasan_report+0x257/0x370
  [] __asan_report_load4_noabort+0x19/0x20
  [] to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib]
  [] mlx4_ib_query_qp+0x1213/0x1660 [mlx4_ib]
  [] qpstat_print_qp+0x13b/0x500 [ib_uverbs]
  [] qpstat_seq_show+0x4a/0xb0 [ib_uverbs]
  [] seq_read+0xa9c/0x1230
  [] proc_reg_read+0xc1/0x180
  [] __vfs_read+0xe8/0x730
  [] vfs_read+0xf7/0x300
  [] SyS_read+0xd2/0x1b0
  [] do_syscall_64+0x186/0x420
  [] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f851a7bb30d
RSP: 002b:7ffd09a758c0 EFLAGS: 0293 ORIG_RAX: 
RAX: ffda RBX: 7f84ff959440 RCX: 7f851a7bb30d
RDX: 0003fc00 RSI: 7f84ff60a000 RDI: 000b
RBP: 7ffd09a75900 R08:  R09: 
R10: 0022 R11: 0293 R12: 
R13: 0003 R14: 0003 R15: 7f84ff60a000

Allocated by task 4488:
  save_stack+0x46/0xd0
  kasan_kmalloc+0xad/0xe0
  __kmalloc+0x101/0x5e0
  ib_register_device+0xc03/0x1250 [ib_core]
  mlx4_ib_add+0x27d6/0x4dd0 [mlx4_ib]
  mlx4_add_device+0xa9/0x340 [mlx4_core]
  mlx4_register_interface+0x16e/0x390 [mlx4_core]
  xhci_pci_remove+0x7a/0x180 [xhci_pci]
  do_one_initcall+0xa0/0x230
  do_init_module+0x1b9/0x5a4
  load_module+0x63e6/0x94c0
  SYSC_init_module+0x1a4/0x1c0
  SyS_init_module+0xe/0x10
  do_syscall_64+0x186/0x420
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Freed by task 0:
(stack is not available)

The buggy address belongs to the object at 8817df584f40
  which belongs to the cache kmalloc-32 of size 32
The buggy address is located 8 bytes to the right of
  32-byte region [8817df584f40, 8817df584f60)
The buggy address belongs to the page:
page:ea005f7d6100 count:1 mapcount:0 mapping:8817df584000 
index:0x8817df584fc1
flags: 0x8800100(slab)
raw: 08800100 8817df584000 8817df584fc1 0001003f
raw: ea005f3ac0a0 ea005c476760 8817fec00900 883ff78d26c0
page dumped because: kasan: bad access detected
page->mem_cgroup:883ff78d26c0

Memory state around the buggy address:
  8817df584e00: 00 03 fc fc fc fc fc fc 00 03 fc fc fc fc fc fc
  8817df584e80: 00 00 00 04 fc fc fc fc 00 00 00 fc fc fc fc fc

8817df584f00: fb fb fb fb fc fc fc fc 00 00 00 00 fc fc fc fc

   ^
  8817df584f80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
  8817df585000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks")
Signed-off-by: Eric Dumazet 
Cc: John Sperbeck 
Cc: Tarick Bedeir 
Cc: Qing Huang 
Cc: Daniel Jurgens 
Cc: Zhu Yanjun 
Cc: Tariq Toukan 
---
  drivers/net/ethernet/mellanox/mlx4/icm.c | 17 +++--
  1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 
685337d58276fc91baeeb64387c52985e1bc6dda..cae33d5c7dbd9ba7929adcf2127b104f6796fa5a
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,13 @@
  #include "fw.h"
  
  /*

- * We allocate in page size (default 4KB on many archs) chunks to avoid high
- * order memory allocations in fragmented/high usage memory situation.
+ * We allocate in as big chunks as we can, up to a maximum of 256 KB
+ * per chunk. Note that the chunks are not necessarily in contiguous
+ * physical memory.
   */
  enum {
-   MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
-   MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
+   MLX4_ICM_ALLOC_SIZE = 1 << 18,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << 18,
  };
  
  static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)

@@ -135,6 +136,7 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int 
npages,
struct mlx4_icm *icm;
struct mlx4_icm_chunk *chunk = NULL;
int cur_order;
+   gfp_t mask;
int ret;
  
  	/* We use sg_set_buf for coherent allocs, which assumes low memory */

@@ -178,13 +180

[PATCH V4] mlx4_core: allocate ICM memory in page size chunks

2018-05-23 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use kvzalloc to replace kcalloc which will fall back
to vmalloc automatically if kmalloc fails.

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v4: use kvzalloc instead of vzalloc
add one err condition check
don't include vmalloc.h any more

v3: use PAGE_SIZE instead of PAGE_SHIFT
add comma to the end of enum variables
include vmalloc.h header file to avoid build issues on Sparc

v2: adjusted chunk size to reflect different architectures

 drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..685337d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
+   MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
u64 size;
 
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
+   if (WARN_ON(!obj_per_chunk))
+   return -EINVAL;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL);
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   kvfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   kvfree(table->icm);
 }
-- 
2.9.3



Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-22 Thread Qing Huang



On 5/22/2018 8:33 AM, Tariq Toukan wrote:



On 18/05/2018 12:45 AM, Qing Huang wrote:



On 5/17/2018 2:14 PM, Eric Dumazet wrote:

On 05/17/2018 01:53 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look 
athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d 



Would you please take a look at how table->icm is being used in the 
mlx4 driver? It's a meta data used for individual pointer variable 
referencing,
not as data frag or in/out buffer. It has no need for contiguous phy. 
memory.


Thanks.



NACK.

This would cause a degradation when iterating the entries of table->icm.
For example, in mlx4_table_get_range.

E.g.
int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
 u32 start, u32 end)
{
    int inc = MLX4_TABLE_CHUNK_SIZE / table->obj_size;
    int err;
    u32 i;

    for (i = start; i <= end; i += inc) {
    err = mlx4_table_get(dev, table, i);
    if (err)
    goto fail;
    }

    return 0;
...
}

E.g. mtt obj is 8 bytes, so a 4KB ICM block would have 512 mtt objects. 
So you will have to allocate
more 512 mtt objects in order to have table->icm pointer to increment by 
1 to fetch next pointer
value.  So 256K mtt objects are needed in order to traverse table->icm 
pointer across a page boundary

in the call stacks.

Considering mlx4_table_get_range() is only used in control path, there 
is no significant gain by using kvzalloc

vs. vzalloc for table->icm.

Anyway, if a user makes sure mlx4 driver to be loaded very early and 
doesn't remove and reload it afterwards,
we should have enough (and not wasting) contiguous phy mem for 
table->icm allocation. I will use kvzalloc to

replace vzalloc and send a V4 patch.

Thanks,
Qing




Thanks,
Tariq


And you'll understand some people care about this.

Strongly.

Thanks.




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html




Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-17 Thread Qing Huang



On 5/17/2018 2:14 PM, Eric Dumazet wrote:

On 05/17/2018 01:53 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look 
athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d


Would you please take a look at how table->icm is being used in the mlx4 
driver? It's a meta data used for individual pointer variable referencing,
not as data frag or in/out buffer. It has no need for contiguous phy. 
memory.


Thanks.


And you'll understand some people care about this.

Strongly.

Thanks.





[PATCH v3] mlx4_core: allocate ICM memory in page size chunks

2018-05-17 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v3: use PAGE_SIZE instead of PAGE_SHIFT
add comma to the end of enum variables
include vmalloc.h header file to avoid build issues on Sparc

v2: adjuste chunk size to reflect different architectures

 drivers/net/ethernet/mellanox/mlx4/icm.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..3705f4e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -33,6 +33,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -43,12 +44,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
+   MLX4_TABLE_CHUNK_SIZE   = PAGE_SIZE,
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +401,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +447,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +463,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 }
-- 
2.9.3



Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-15 Thread Qing Huang



On 5/15/2018 12:08 PM, Eric Dumazet wrote:


On 05/15/2018 11:53 AM, Qing Huang wrote:

This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc so we 
maintain a similar behavior when contiguous memory is available, and a fallback 
for resiliency.

No sure what exactly degradation is caused by vzalloc here. I think it's better 
to keep physically contiguous pages
to other requests which really need them. Besides slow path/mem compacting can 
be really expensive.


Just use kvzalloc(), and you get the benefit of having contiguous memory if 
available,
without expensive compact phase.

This thing _automatically_ falls back to vmalloc(), thus your problem will be 
solved.

If you are not sure, trust others.


Thanks for the review. There are many places in kernel and applications 
where physically contiguous pages are needed.
We saw quite a few issues when there were not enough contiguous phy mem 
available. My main concern here is that why

using physically contiguous pages when they are not really needed?






Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-15 Thread Qing Huang



On 5/15/2018 2:19 AM, Tariq Toukan wrote:



On 14/05/2018 7:41 PM, Qing Huang wrote:



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for 
each mtt

entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to 
avoid high
+ * order memory allocations in fragmented/high usage memory 
situation.

   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.



Then please use PAGE_SIZE instead.


PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT 
is actually more appropriate here.






Also, please add a comma at the end of the last entry.


Hmm..., followed the existing code style and checkpatch.pl didn't 
complain about the comma.




I am in favor of having a comma also after the last element, so that 
when another enum element is added we do not modify this line again, 
which would falsely affect git blame.


I know it didn't exist before your patch, but once we're here, let's 
do it.


I'm okay either way. If adding an extra comma is preferred by many 
people, someone should update checkpatch.pl to enforce it. :)







  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


I think table->icm pointer array doesn't really need physically 
contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause 
tasks to be blocked.




This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc 
so we maintain a similar behavior when contiguous memory is available, 
and a fallback for resiliency.


No sure what exactly degradation is caused by vzalloc here. I think it's 
better to keep physically contiguous pages
to other requests which really need them. Besides slow path/mem 
compacting can be really expensive.





Thanks,
Qing




  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev 
*dev, struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
  }



Thanks for your patch.

I need t

Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-14 Thread Qing Huang



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to 
avoid high

+ * order memory allocations in fragmented/high usage memory situation.
   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.


Also, please add a comma at the end of the last entry.


Hmm..., followed the existing code style and checkpatch.pl didn't 
complain about the comma.





  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));


Why not kvzalloc ?


I think table->icm pointer array doesn't really need physically 
contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause 
tasks to be blocked.


Thanks,
Qing




  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
  }



Thanks for your patch.

I need to verify there is no dramatic performance degradation here.
You can prepare and send a v3 in the meanwhile.

Thanks,
Tariq
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[PATCH V2] mlx4_core: allocate ICM memory in page size chunks

2018-05-11 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
Reviewed-by: Zhu Yanjun 
---
v2 -> v1: adjusted chunk size to reflect different architectures.

 drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << PAGE_SHIFT
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 }
-- 
2.9.3



Re: [PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-11 Thread Qing Huang


On 5/11/2018 3:27 AM, Håkon Bugge wrote:

On 11 May 2018, at 01:31, Qing Huang  wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang
Acked-by: Daniel Jurgens
---
drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
#include "fw.h"

/*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
  */
enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << 12,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << 12

Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the 
hw page size is 8KiB.


Good point on supporting wider range of architectures. I got tunnel 
vision when fixing this on our x64 lab machines.

Will send an v2 patch.

Thanks,
Qing


Thxs, Håkon


};

static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;

-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}

-   kfree(table->icm);
+   vfree(table->icm);

return -ENOMEM;
}
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}

-   kfree(table->icm);
+   vfree(table->icm);
}
--
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message tomajord...@vger.kernel.org
More majordomo info athttp://vger.kernel.org/majordomo-info.html




Re: [PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-10 Thread Qing Huang

Thank you for reviewing it!


On 5/10/2018 6:23 PM, Yanjun Zhu wrote:




On 2018/5/11 9:15, Qing Huang wrote:




On 5/10/2018 5:13 PM, Yanjun Zhu wrote:



On 2018/5/11 7:31, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for 
each mtt

entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need

Hi,

Replace continuous memory pages with virtual memory, is there any 
performance loss?


Not really. "table->icm" will be accessed as individual pointer 
variables randomly. Kcalloc


Sure. Thanks. If "table->icm" will be accessed as individual pointer 
variables randomly, the performance loss

caused by discontinuous memory will be very trivial.

Reviewed-by: Zhu Yanjun 

also returns a virtual address except its mapped pages are guaranteed 
to be contiguous
which will provide little advantage over vzalloc for individual 
pointer variable access.


Qing



Zhu Yanjun

of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
---
  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << 12,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << 12
  };
    static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
  num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
  -    table->icm  = kcalloc(num_icm, sizeof(*table->icm), 
GFP_KERNEL);

+    table->icm  = vzalloc(num_icm * sizeof(*table->icm));
  if (!table->icm)
  return -ENOMEM;
  table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
struct mlx4_icm_table *table,

  mlx4_free_icm(dev, table->icm[i], use_coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
    return -ENOMEM;
  }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev 
*dev, struct mlx4_icm_table *table)

  mlx4_free_icm(dev, table->icm[i], table->coherent);
  }
  -    kfree(table->icm);
+    vfree(table->icm);
  }










[PATCH] mlx4_core: allocate 4KB ICM chunks

2018-05-10 Thread Qing Huang
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang 
Acked-by: Daniel Jurgens 
---
 drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
 #include "fw.h"
 
 /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
  */
 enum {
-   MLX4_ICM_ALLOC_SIZE = 1 << 18,
-   MLX4_TABLE_CHUNK_SIZE   = 1 << 18
+   MLX4_ICM_ALLOC_SIZE = 1 << 12,
+   MLX4_TABLE_CHUNK_SIZE   = 1 << 12
 };
 
 static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk 
*chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
 
-   table->icm  = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+   table->icm  = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 
return -ENOMEM;
 }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct 
mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}
 
-   kfree(table->icm);
+   vfree(table->icm);
 }
-- 
2.9.3



Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands

2018-04-26 Thread Qing Huang
Thanks! Looks like Eran's patch will be available much sooner than v4.18 
release time frame (since v4.16 was just released).


We will wait. :-)


On 04/26/2018 04:43 PM, Saeed Mahameed wrote:

Before you address my comments, it looks like Eran's work is
converging and we will finalize the internal
review next week, in which case submission will take 2-3 weeks from today.

So i suggest to wait.

please checkout:
https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=testing/mlx5e-stats&id=15ffb7f87432d073e8ac0e6b895188d40fdda4d4




Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands

2018-04-26 Thread Qing Huang



On 04/26/2018 02:50 PM, Saeed Mahameed wrote:

On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang  wrote:

Current stats collecting scheme in mlx5 driver is to periodically fetch
aggregated stats from all the active mlx5 software channels associated
with the device. However when a mlx5 interface is brought down(ifdown),
all the channels will be deactivated and closed. A new set of channels
will be created when next ifup command or a similar command is called.
Unfortunately the new channels will have all stats reset to 0. So you
lose the accumulated stats information. This behavior is different from
other netdev drivers including the mlx4 driver. In order to fix it, we
now save prior mlx5 software stats into netdev stats fields, so all the
accumulated stats will survive multiple runs of ifdown/ifup commands and
be shown correctly.

Orabug: 27548610

Signed-off-by: Qing Huang 
---

Hi Qing,

I am adding Eran since he is currently working on a similar patch,
He is also taking care of all cores/rings stats to make them
persistent, so you won't have discrepancy  between
ethtool and ifconfig stats.

I am ok with this patch, but this means Eran has to work his way around it.

so we have two options:

1. Temporary accept this patch, and change it later with Eran's work.
2. Wait for Eran's work.

I am ok with either one of them, please let me know.

Thanks !

Hi Saeed,

Any idea on rough ETA of Eran's stats work to be in upstream? If it will 
be available soon, I think
we can wait a bit. If it will take a while to redesign the whole stats 
scheme (for both ethtool and ifconfig),

maybe we can go with this incremental fix first?

Thanks!




  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++
  1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f1fe490..5d50e69 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device 
*netdev)
 netdev_set_tc_queue(netdev, tc, nch, 0);
  }

+static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
+{
+   struct net_device *netdev = priv->netdev;
+
+   netdev->stats.rx_packets += priv->stats.sw.rx_packets;
+   netdev->stats.rx_bytes   += priv->stats.sw.rx_bytes;
+   netdev->stats.tx_packets += priv->stats.sw.tx_packets;
+   netdev->stats.tx_bytes   += priv->stats.sw.tx_bytes;
+   netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
+
+   priv->stats.sw.rx_packets   = 0;
+   priv->stats.sw.rx_bytes = 0;
+   priv->stats.sw.tx_packets   = 0;
+   priv->stats.sw.tx_bytes = 0;
+   priv->stats.sw.tx_queue_dropped = 0;
+}
+

This means that we are now explicitly clearing channels stats on
ifconfig down or switch_channels.
and now after ifconfing down, ethtool will always show 0, before this
patch it didn't.
Anyway update sw stats function will always override them with the new
channels stats next time we load new channels.
so it is not that big of a deal.



  static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
  {
 struct mlx5e_channel *c;
@@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
 netif_set_real_num_tx_queues(netdev, new_num_txqs);

 mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
 mlx5e_close_channels(&priv->channels);

 priv->channels = *new_chs;
@@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)

 netif_carrier_off(priv->netdev);
 mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
 mlx5e_close_channels(&priv->channels);

 return 0;
@@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
 stats->tx_packets = PPORT_802_3_GET(pstats, 
a_frames_transmitted_ok);
 stats->tx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_transmitted_ok);
 } else {
-   stats->rx_packets = sstats->rx_packets;
-   stats->rx_bytes   = sstats->rx_bytes;
-   stats->tx_packets = sstats->tx_packets;
-   stats->tx_bytes   = sstats->tx_bytes;
-   stats->tx_dropped = sstats->tx_queue_dropped;
+   stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets;
+   stats->rx_bytes   = sstats->rx_bytes + dev->stats.rx_bytes;
+   stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets;
+   stats->tx_bytes   = sstats->tx_bytes + dev->stats.tx_bytes;
+   stats->tx_dropped = sstats->tx_queue_dropped +
+   dev->stats.tx_dropped;
 }

 stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
--
1.8.3.1





[PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands

2018-04-26 Thread Qing Huang
Current stats collecting scheme in mlx5 driver is to periodically fetch
aggregated stats from all the active mlx5 software channels associated
with the device. However when a mlx5 interface is brought down(ifdown),
all the channels will be deactivated and closed. A new set of channels
will be created when next ifup command or a similar command is called.
Unfortunately the new channels will have all stats reset to 0. So you
lose the accumulated stats information. This behavior is different from
other netdev drivers including the mlx4 driver. In order to fix it, we
now save prior mlx5 software stats into netdev stats fields, so all the
accumulated stats will survive multiple runs of ifdown/ifup commands and
be shown correctly.

Orabug: 27548610

Signed-off-by: Qing Huang 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f1fe490..5d50e69 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device 
*netdev)
netdev_set_tc_queue(netdev, tc, nch, 0);
 }
 
+static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
+{
+   struct net_device *netdev = priv->netdev;
+
+   netdev->stats.rx_packets += priv->stats.sw.rx_packets;
+   netdev->stats.rx_bytes   += priv->stats.sw.rx_bytes;
+   netdev->stats.tx_packets += priv->stats.sw.tx_packets;
+   netdev->stats.tx_bytes   += priv->stats.sw.tx_bytes;
+   netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
+
+   priv->stats.sw.rx_packets   = 0;
+   priv->stats.sw.rx_bytes = 0;
+   priv->stats.sw.tx_packets   = 0;
+   priv->stats.sw.tx_bytes = 0;
+   priv->stats.sw.tx_queue_dropped = 0;
+}
+
 static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
 {
struct mlx5e_channel *c;
@@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
netif_set_real_num_tx_queues(netdev, new_num_txqs);
 
mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
mlx5e_close_channels(&priv->channels);
 
priv->channels = *new_chs;
@@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)
 
netif_carrier_off(priv->netdev);
mlx5e_deactivate_priv_channels(priv);
+   mlx5e_netdev_save_stats(priv);
mlx5e_close_channels(&priv->channels);
 
return 0;
@@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
stats->tx_packets = PPORT_802_3_GET(pstats, 
a_frames_transmitted_ok);
stats->tx_bytes   = PPORT_802_3_GET(pstats, 
a_octets_transmitted_ok);
} else {
-   stats->rx_packets = sstats->rx_packets;
-   stats->rx_bytes   = sstats->rx_bytes;
-   stats->tx_packets = sstats->tx_packets;
-   stats->tx_bytes   = sstats->tx_bytes;
-   stats->tx_dropped = sstats->tx_queue_dropped;
+   stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets;
+   stats->rx_bytes   = sstats->rx_bytes + dev->stats.rx_bytes;
+   stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets;
+   stats->tx_bytes   = sstats->tx_bytes + dev->stats.tx_bytes;
+   stats->tx_dropped = sstats->tx_queue_dropped +
+   dev->stats.tx_dropped;
}
 
stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
-- 
1.8.3.1



Re: Setting large MTU size on slave interfaces may stall the whole system

2017-12-14 Thread Qing Huang

Hi Or,


On 12/13/2017 06:28 AM, Or Gerlitz wrote:

On Tue, Dec 12, 2017 at 5:21 AM, Qing Huang  wrote:

Hi,

We found an issue with the bonding driver when testing Mellanox devices.
The following test commands will stall the whole system sometimes, with
serial console
flooded with log messages from the bond_miimon_inspect() function. Setting
mtu size
to be 1500 seems okay but very rarely it may hit the same problem too.

ip address flush dev ens3f0
ip link set dev ens3f0 down
ip address flush dev ens3f1
ip link set dev ens3f1 down
[root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1
updelay=500 downdelay=500
[root@ca-hcl629 etc]# ifconfig bond0 up
[root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1
[root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up
Seiral console output:

** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for
interface ens3f0, disabling it in 500 ms

[..]



It seems that when setting a large mtu size on an RoCE interface, the RTNL
mutex may be held too long by the slave
interface, causing bond_mii_monitor() to be called repeatedly at an interval
of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive.

Did you try/managed to reproduce that also with other NIC drivers?
This seems to be an issue with the bonding driver. Also running older 
kernels on the same
hardware without commit (de77ecd4ef02) seems to work fine. We can try to 
reproduce the

issue with other type of NIC hardware.



Or.




Setting large MTU size on slave interfaces may stall the whole system

2017-12-11 Thread Qing Huang

(resend this email in text format)


Hi,

We found an issue with the bonding driver when testing Mellanox devices.
The following test commands will stall the whole system sometimes, with 
serial console
flooded with log messages from the bond_miimon_inspect() function. 
Setting mtu size

to be 1500 seems okay but very rarely it may hit the same problem too.

ip address flush dev ens3f0
ip link set dev ens3f0 down
ip address flush dev ens3f1
ip link set dev ens3f1 down
[root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1
updelay=500 downdelay=500
[root@ca-hcl629 etc]# ifconfig bond0 up
[root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1
[root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up


Seiral console output:

** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 5 printk messages dropped ** [ 3717.755737] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 5 printk messages dropped ** [ 3717.767758] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 4 printk messages dropped ** [ 3717.37] bond0: link status down for
interface ens3f0, disabling it in 500 ms

or

** 4 printk messages dropped ** [274743.297863] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.307866] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.317857] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.327823] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.337817] bond0: link status down 
again

after 500 ms for interface enp48s0f1


The root cause is the combined affect from commit 
1f2cd845d3827412e82bf26dde0abca332ede402(Revert
"Merge branch 'bonding_monitor_locking'") and commit 
de77ecd4ef02ca783f7762e04e92b3d0964be66b
("bonding: improve link-status update in mii-monitoring"). E.g. 
reverting the second commit, we don't

see the problem.

It seems that when setting a large mtu size on an RoCE interface, the 
RTNL mutex may be held too long by the slave
interface, causing bond_mii_monitor() to be called repeatedly at an 
interval of 1 tick (1K HZ kernel configuration)

and kernel to become unresponsive.


We found two possible solutions:

#1, don't re-arm the mii monitor thread too quick if we cannot get RTNL 
lock:

index b2db581..8fd587a 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2266,7 +2266,6 @@ static void bond_mii_monitor(struct work_struct 
*work)


    /* Race avoidance with bond_close cancel of workqueue */
    if (!rtnl_trylock()) {
-   delay = 1;
    should_notify_peers = false;
    goto re_arm;
    }

#2, we use printk_ratelimit() to avoid flooding log messages generated 
by bond_miimon_inspect().


index b2db581..0183b7f 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2054,7 +2054,7 @@ static int bond_miimon_inspect(struct bonding *bond)
    bond_propose_link_state(slave, BOND_LINK_FAIL);
    commit++;
    slave->delay = bond->params.downdelay;
-   if (slave->delay) {
+   if (slave->delay && printk_ratelimit()) {
    netdev_info(bond->dev, "link status 
down for

%sinterface %s, disabling it in %d ms\n",
    (BOND_MODE(bond) ==
BOND_MODE_ACTIVEBACKUP) ?
@@ -2105,7 +2105,8 @@ static int bond_miimon_inspect(struct bonding *bond)
    case BOND_LINK_BACK:
    if (!link_state) {
    bond_propose_link_state(slave,
BOND_LINK_DOWN);
-   netdev_info(bond->dev, "link status down
again after %d ms for interface %s\n",
+   if(printk_ratelimit())
+   netdev_info(bond->dev, "link status
down again after %d ms for interface %s\n",
(bond->params.updelay -
slave->delay) *
bond->params.miimon,
slave->dev->name);


Regarding the flooding messages, the netdev_info output is misleading 
anyway
when bond_mii_monitor() is called at 1 tick interval due to lock 
contention.



Solution #1 looks simpler and cleaner to me. Any side affect of doing that?


Thanks,
Qing