Re: [PATCH v2 net] mlx4_core: restore optimal ICM memory allocation
On 5/31/2018 5:52 AM, Eric Dumazet wrote: Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks") brought two regressions caught in our regression suite. The big one is an additional cost of 256 bytes of overhead per 4096 bytes, or 6.25 % which is unacceptable since ICM can be pretty large. This comes from having to allocate one struct mlx4_icm_chunk (256 bytes) per MLX4_TABLE_CHUNK, which the buggy commit shrank to 4KB (instead of prior 256KB) It would be great if you could share the test case that triggered the KASAN report in your environment. Our QA has been running intensive tests using 8KB or 4KB chunk size configuration for some time, no one has reported memory corruption issues so far, given that we are not running the latest upstream kernel. IMO, it's worthwhile to find out the root cause and fix the problem. Note that mlx4_alloc_icm() is already able to try high order allocations and fallback to low-order allocations under high memory pressure. Most of these allocations happen right after boot time, when we get plenty of non fragmented memory, there is really no point being so pessimistic and break huge pages into order-0 ones just for fun. We only have to tweak gfp_mask a bit, to help falling back faster, without risking OOM killings. Just FYI, out of memory wasn't our original concern. We didn't encounter OOM killings. Second regression is an KASAN fault, that will need further investigations. Fixes: 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks") Signed-off-by: Eric Dumazet Acked-by: Tariq Toukan Cc: John Sperbeck Cc: Tarick Bedeir Cc: Qing Huang Cc: Daniel Jurgens Cc: Zhu Yanjun --- drivers/net/ethernet/mellanox/mlx4/icm.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index 685337d58276fc91baeeb64387c52985e1bc6dda..5342bd8a3d0bfaa9e76bb9b6943790606c97b181 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,13 @@ #include "fw.h" /* - * We allocate in page size (default 4KB on many archs) chunks to avoid high - * order memory allocations in fragmented/high usage memory situation. + * We allocate in as big chunks as we can, up to a maximum of 256 KB + * per chunk. Note that the chunks are not necessarily in contiguous + * physical memory. */ enum { - MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, - MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, + MLX4_ICM_ALLOC_SIZE = 1 << 18, + MLX4_TABLE_CHUNK_SIZE = 1 << 18, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -135,6 +136,7 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages, struct mlx4_icm *icm; struct mlx4_icm_chunk *chunk = NULL; int cur_order; + gfp_t mask; int ret; /* We use sg_set_buf for coherent allocs, which assumes low memory */ @@ -178,13 +180,17 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages, while (1 << cur_order > npages) --cur_order; + mask = gfp_mask; + if (cur_order) + mask &= ~__GFP_DIRECT_RECLAIM; + if (coherent) ret = mlx4_alloc_icm_coherent(&dev->persist->pdev->dev, &chunk->mem[chunk->npages], - cur_order, gfp_mask); + cur_order, mask); else ret = mlx4_alloc_icm_pages(&chunk->mem[chunk->npages], - cur_order, gfp_mask, + cur_order, mask, dev->numa_node); if (ret) {
Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation
On 5/30/2018 2:30 PM, Eric Dumazet wrote: On Wed, May 30, 2018 at 5:08 PM Qing Huang wrote: On 5/30/2018 1:50 PM, Eric Dumazet wrote: On Wed, May 30, 2018 at 4:30 PM Qing Huang wrote: On 5/29/2018 9:11 PM, Eric Dumazet wrote: Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks") brought a regression caught in our regression suite, thanks to KASAN. If KASAN reported issue was really caused by smaller chunk sizes, changing allocation order dynamically will eventually hit the same issue. Sigh, you have little idea of what your patch really did... The KASAN part only shows the tip of the iceberg, but our main concern is an increase of memory overhead. Well, the commit log only mentioned KASAN and but the change here didn't seem to solve the issue. Can you elaborate ? My patch solves our problems. Both the memory overhead and KASAN splats are gone. If KASAN issue was triggered by using smaller chunks, when under memory pressure with lots of fragments, low order memory allocation will do the similar things. So perhaps in your test env, memory allocation and usage is relatively static, that's probably why using larger chunks didn't really utilize low order allocation code path hence no KASAN issue was spotted. Smaller chunk size in the mlx4 driver is not supposed to cause any memory corruption. We will probably need to continue to investigate this. Can you provide the test command that trigger this issue when running KASAN kernel so we can try to reproduce it in our lab? It could be that upstream code is missing some other fixes. Alternative is to revert your patch, since we are now very late in 4.17 cycle. Memory usage has grown a lot with your patch, since each 4KB page needs a full struct mlx4_icm_chunk (256 bytes of overhead !) Going to smaller chunks will have some overhead. It depends on the application though. What's the total increased memory consumption in your env? As I explained, your patch adds 256 bytes of overhead per 4KB. Your changelog did not mentioned that at all, and we discovered this the hard way. If you have some concern regarding memory usage, you should bring this up during code review. Repeated failure and retry for lower order allocations could be bad for latency too. This wasn't mentioned in this commit either. Like I said, how much overhead really depends on the application. 256 bytes x chunks may not be significant on a server with lots of memory. That is pretty intolerable, and is a blocker for us, memory is precious. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message tomajord...@vger.kernel.org More majordomo info athttp://vger.kernel.org/majordomo-info.html
Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation
On 5/30/2018 1:50 PM, Eric Dumazet wrote: On Wed, May 30, 2018 at 4:30 PM Qing Huang wrote: On 5/29/2018 9:11 PM, Eric Dumazet wrote: Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks") brought a regression caught in our regression suite, thanks to KASAN. If KASAN reported issue was really caused by smaller chunk sizes, changing allocation order dynamically will eventually hit the same issue. Sigh, you have little idea of what your patch really did... The KASAN part only shows the tip of the iceberg, but our main concern is an increase of memory overhead. Well, the commit log only mentioned KASAN and but the change here didn't seem to solve the issue. Alternative is to revert your patch, since we are now very late in 4.17 cycle. Memory usage has grown a lot with your patch, since each 4KB page needs a full struct mlx4_icm_chunk (256 bytes of overhead !) Going to smaller chunks will have some overhead. It depends on the application though. What's the total increased memory consumption in your env? Really we have no choice here, your patch went too far and increased memory consumption quite a lot. My patch is simply the best way to address your original concern, and not increase overall overhead. ( each struct mlx4_icm_chunk should be able to store MLX4_ICM_CHUNK_LEN pages, instead of one page of 4KB )
Re: [PATCH net] mlx4_core: restore optimal ICM memory allocation
On 5/29/2018 9:11 PM, Eric Dumazet wrote: Commit 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks") brought a regression caught in our regression suite, thanks to KASAN. If KASAN reported issue was really caused by smaller chunk sizes, changing allocation order dynamically will eventually hit the same issue. Note that mlx4_alloc_icm() is already able to try high order allocations and fallback to low-order allocations under high memory pressure. We only have to tweak gfp_mask a bit, to help falling back faster, without risking OOM killings. BUG: KASAN: slab-out-of-bounds in to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib] Read of size 4 at addr 8817df584f68 by task qp_listing_test/92585 CPU: 38 PID: 92585 Comm: qp_listing_test Tainted: G O Call Trace: [] dump_stack+0x4d/0x72 [] print_address_description+0x6f/0x260 [] kasan_report+0x257/0x370 [] __asan_report_load4_noabort+0x19/0x20 [] to_rdma_ah_attr+0x808/0x9e0 [mlx4_ib] [] mlx4_ib_query_qp+0x1213/0x1660 [mlx4_ib] [] qpstat_print_qp+0x13b/0x500 [ib_uverbs] [] qpstat_seq_show+0x4a/0xb0 [ib_uverbs] [] seq_read+0xa9c/0x1230 [] proc_reg_read+0xc1/0x180 [] __vfs_read+0xe8/0x730 [] vfs_read+0xf7/0x300 [] SyS_read+0xd2/0x1b0 [] do_syscall_64+0x186/0x420 [] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f851a7bb30d RSP: 002b:7ffd09a758c0 EFLAGS: 0293 ORIG_RAX: RAX: ffda RBX: 7f84ff959440 RCX: 7f851a7bb30d RDX: 0003fc00 RSI: 7f84ff60a000 RDI: 000b RBP: 7ffd09a75900 R08: R09: R10: 0022 R11: 0293 R12: R13: 0003 R14: 0003 R15: 7f84ff60a000 Allocated by task 4488: save_stack+0x46/0xd0 kasan_kmalloc+0xad/0xe0 __kmalloc+0x101/0x5e0 ib_register_device+0xc03/0x1250 [ib_core] mlx4_ib_add+0x27d6/0x4dd0 [mlx4_ib] mlx4_add_device+0xa9/0x340 [mlx4_core] mlx4_register_interface+0x16e/0x390 [mlx4_core] xhci_pci_remove+0x7a/0x180 [xhci_pci] do_one_initcall+0xa0/0x230 do_init_module+0x1b9/0x5a4 load_module+0x63e6/0x94c0 SYSC_init_module+0x1a4/0x1c0 SyS_init_module+0xe/0x10 do_syscall_64+0x186/0x420 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Freed by task 0: (stack is not available) The buggy address belongs to the object at 8817df584f40 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 8 bytes to the right of 32-byte region [8817df584f40, 8817df584f60) The buggy address belongs to the page: page:ea005f7d6100 count:1 mapcount:0 mapping:8817df584000 index:0x8817df584fc1 flags: 0x8800100(slab) raw: 08800100 8817df584000 8817df584fc1 0001003f raw: ea005f3ac0a0 ea005c476760 8817fec00900 883ff78d26c0 page dumped because: kasan: bad access detected page->mem_cgroup:883ff78d26c0 Memory state around the buggy address: 8817df584e00: 00 03 fc fc fc fc fc fc 00 03 fc fc fc fc fc fc 8817df584e80: 00 00 00 04 fc fc fc fc 00 00 00 fc fc fc fc fc 8817df584f00: fb fb fb fb fc fc fc fc 00 00 00 00 fc fc fc fc ^ 8817df584f80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc 8817df585000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 1383cb8103bb ("mlx4_core: allocate ICM memory in page size chunks") Signed-off-by: Eric Dumazet Cc: John Sperbeck Cc: Tarick Bedeir Cc: Qing Huang Cc: Daniel Jurgens Cc: Zhu Yanjun Cc: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/icm.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index 685337d58276fc91baeeb64387c52985e1bc6dda..cae33d5c7dbd9ba7929adcf2127b104f6796fa5a 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,13 @@ #include "fw.h" /* - * We allocate in page size (default 4KB on many archs) chunks to avoid high - * order memory allocations in fragmented/high usage memory situation. + * We allocate in as big chunks as we can, up to a maximum of 256 KB + * per chunk. Note that the chunks are not necessarily in contiguous + * physical memory. */ enum { - MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, - MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, + MLX4_ICM_ALLOC_SIZE = 1 << 18, + MLX4_TABLE_CHUNK_SIZE = 1 << 18, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -135,6 +136,7 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages, struct mlx4_icm *icm; struct mlx4_icm_chunk *chunk = NULL; int cur_order; + gfp_t mask; int ret; /* We use sg_set_buf for coherent allocs, which assumes low memory */ @@ -178,13 +180
[PATCH V4] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use kvzalloc to replace kcalloc which will fall back to vmalloc automatically if kmalloc fails. Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v4: use kvzalloc instead of vzalloc add one err condition check don't include vmalloc.h any more v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjusted chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..685337d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -398,9 +398,11 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, u64 size; obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; + if (WARN_ON(!obj_per_chunk)) + return -EINVAL; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = kvzalloc(num_icm * sizeof(*table->icm), GFP_KERNEL); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +448,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + kvfree(table->icm); return -ENOMEM; } @@ -462,5 +464,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + kvfree(table->icm); } -- 2.9.3
Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks
On 5/22/2018 8:33 AM, Tariq Toukan wrote: On 18/05/2018 12:45 AM, Qing Huang wrote: On 5/17/2018 2:14 PM, Eric Dumazet wrote: On 05/17/2018 01:53 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d Would you please take a look at how table->icm is being used in the mlx4 driver? It's a meta data used for individual pointer variable referencing, not as data frag or in/out buffer. It has no need for contiguous phy. memory. Thanks. NACK. This would cause a degradation when iterating the entries of table->icm. For example, in mlx4_table_get_range. E.g. int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table, u32 start, u32 end) { int inc = MLX4_TABLE_CHUNK_SIZE / table->obj_size; int err; u32 i; for (i = start; i <= end; i += inc) { err = mlx4_table_get(dev, table, i); if (err) goto fail; } return 0; ... } E.g. mtt obj is 8 bytes, so a 4KB ICM block would have 512 mtt objects. So you will have to allocate more 512 mtt objects in order to have table->icm pointer to increment by 1 to fetch next pointer value. So 256K mtt objects are needed in order to traverse table->icm pointer across a page boundary in the call stacks. Considering mlx4_table_get_range() is only used in control path, there is no significant gain by using kvzalloc vs. vzalloc for table->icm. Anyway, if a user makes sure mlx4 driver to be loaded very early and doesn't remove and reload it afterwards, we should have enough (and not wasting) contiguous phy mem for table->icm allocation. I will use kvzalloc to replace vzalloc and send a V4 patch. Thanks, Qing Thanks, Tariq And you'll understand some people care about this. Strongly. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks
On 5/17/2018 2:14 PM, Eric Dumazet wrote: On 05/17/2018 01:53 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... NACK on this patch. You have been asked repeatedly to use kvmalloc() This is not a minor suggestion. Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d Would you please take a look at how table->icm is being used in the mlx4 driver? It's a meta data used for individual pointer variable referencing, not as data frag or in/out buffer. It has no need for contiguous phy. memory. Thanks. And you'll understand some people care about this. Strongly. Thanks.
[PATCH v3] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v3: use PAGE_SIZE instead of PAGE_SHIFT add comma to the end of enum variables include vmalloc.h header file to avoid build issues on Sparc v2: adjuste chunk size to reflect different architectures drivers/net/ethernet/mellanox/mlx4/icm.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..3705f4e 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -33,6 +33,7 @@ #include #include +#include #include #include @@ -43,12 +44,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = PAGE_SIZE, + MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE, }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +401,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +447,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +463,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 5/15/2018 12:08 PM, Eric Dumazet wrote: On 05/15/2018 11:53 AM, Qing Huang wrote: This is control path so it is less latency-sensitive. Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency. No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages to other requests which really need them. Besides slow path/mem compacting can be really expensive. Just use kvzalloc(), and you get the benefit of having contiguous memory if available, without expensive compact phase. This thing _automatically_ falls back to vmalloc(), thus your problem will be solved. If you are not sure, trust others. Thanks for the review. There are many places in kernel and applications where physically contiguous pages are needed. We saw quite a few issues when there were not enough contiguous phy mem available. My main concern here is that why using physically contiguous pages when they are not really needed?
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 5/15/2018 2:19 AM, Tariq Toukan wrote: On 14/05/2018 7:41 PM, Qing Huang wrote: On 5/13/2018 2:00 AM, Tariq Toukan wrote: On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Yes, we wanted to avoid high order memory allocations. Then please use PAGE_SIZE instead. PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT is actually more appropriate here. Also, please add a comma at the end of the last entry. Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma. I am in favor of having a comma also after the last element, so that when another enum element is added we do not modify this line again, which would falsely affect git blame. I know it didn't exist before your patch, but once we're here, let's do it. I'm okay either way. If adding an extra comma is preferred by many people, someone should update checkpatch.pl to enforce it. :) }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked. This is control path so it is less latency-sensitive. Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency. No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages to other requests which really need them. Besides slow path/mem compacting can be really expensive. Thanks, Qing if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } Thanks for your patch. I need t
Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
On 5/13/2018 2:00 AM, Tariq Toukan wrote: On 11/05/2018 10:23 PM, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT Which is actually PAGE_SIZE. Yes, we wanted to avoid high order memory allocations. Also, please add a comma at the end of the last entry. Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma. }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); Why not kvzalloc ? I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked. Thanks, Qing if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } Thanks for your patch. I need to verify there is no dramatic performance degradation here. You can prepare and send a v3 in the meanwhile. Thanks, Tariq -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2] mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size on x86_64 arch, the above issue is fixed. However in order to support smaller ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens Reviewed-by: Zhu Yanjun --- v2 -> v1: adjusted chunk size to reflect different architectures. drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..ccb62b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in page size (default 4KB on many archs) chunks to avoid high + * order memory allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << PAGE_SHIFT, + MLX4_TABLE_CHUNK_SIZE = 1 << PAGE_SHIFT }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3
Re: [PATCH] mlx4_core: allocate 4KB ICM chunks
On 5/11/2018 3:27 AM, Håkon Bugge wrote: On 11 May 2018, at 01:31, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size, the above issue is fixed. However in order to support 4KB ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens --- drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..2b17a4b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in 4KB page size chunks to avoid high order memory + * allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << 12, + MLX4_TABLE_CHUNK_SIZE = 1 << 12 Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the hw page size is 8KiB. Good point on supporting wider range of architectures. I got tunnel vision when fixing this on our x64 lab machines. Will send an v2 patch. Thanks, Qing Thxs, Håkon }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message tomajord...@vger.kernel.org More majordomo info athttp://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message tomajord...@vger.kernel.org More majordomo info athttp://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4_core: allocate 4KB ICM chunks
Thank you for reviewing it! On 5/10/2018 6:23 PM, Yanjun Zhu wrote: On 2018/5/11 9:15, Qing Huang wrote: On 5/10/2018 5:13 PM, Yanjun Zhu wrote: On 2018/5/11 7:31, Qing Huang wrote: When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size, the above issue is fixed. However in order to support 4KB ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need Hi, Replace continuous memory pages with virtual memory, is there any performance loss? Not really. "table->icm" will be accessed as individual pointer variables randomly. Kcalloc Sure. Thanks. If "table->icm" will be accessed as individual pointer variables randomly, the performance loss caused by discontinuous memory will be very trivial. Reviewed-by: Zhu Yanjun also returns a virtual address except its mapped pages are guaranteed to be contiguous which will provide little advantage over vzalloc for individual pointer variable access. Qing Zhu Yanjun of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens --- drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..2b17a4b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in 4KB page size chunks to avoid high order memory + * allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << 12, + MLX4_TABLE_CHUNK_SIZE = 1 << 12 }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); }
[PATCH] mlx4_core: allocate 4KB ICM chunks
When a system is under memory presure (high usage with fragments), the original 256KB ICM chunk allocations will likely trigger kernel memory management to enter slow path doing memory compact/migration ops in order to complete high order memory allocations. When that happens, user processes calling uverb APIs may get stuck for more than 120s easily even though there are a lot of free pages in smaller chunks available in the system. Syslog: ... Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task oracle_205573_e:205573 blocked for more than 120 seconds. ... With 4KB ICM chunk size, the above issue is fixed. However in order to support 4KB ICM chunk size, we need to fix another issue in large size kcalloc allocations. E.g. Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt entry). So we need a 16MB allocation for a table->icm pointer array to hold 2M pointers which can easily cause kcalloc to fail. The solution is to use vzalloc to replace kcalloc. There is no need for contiguous memory pages for a driver meta data structure (no need of DMA ops). Signed-off-by: Qing Huang Acked-by: Daniel Jurgens --- drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c index a822f7a..2b17a4b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/icm.c +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c @@ -43,12 +43,12 @@ #include "fw.h" /* - * We allocate in as big chunks as we can, up to a maximum of 256 KB - * per chunk. + * We allocate in 4KB page size chunks to avoid high order memory + * allocations in fragmented/high usage memory situation. */ enum { - MLX4_ICM_ALLOC_SIZE = 1 << 18, - MLX4_TABLE_CHUNK_SIZE = 1 << 18 + MLX4_ICM_ALLOC_SIZE = 1 << 12, + MLX4_TABLE_CHUNK_SIZE = 1 << 12 }; static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk) @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size; num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk; - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL); + table->icm = vzalloc(num_icm * sizeof(*table->icm)); if (!table->icm) return -ENOMEM; table->virt = virt; @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table, mlx4_free_icm(dev, table->icm[i], use_coherent); } - kfree(table->icm); + vfree(table->icm); return -ENOMEM; } @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table) mlx4_free_icm(dev, table->icm[i], table->coherent); } - kfree(table->icm); + vfree(table->icm); } -- 2.9.3
Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
Thanks! Looks like Eran's patch will be available much sooner than v4.18 release time frame (since v4.16 was just released). We will wait. :-) On 04/26/2018 04:43 PM, Saeed Mahameed wrote: Before you address my comments, it looks like Eran's work is converging and we will finalize the internal review next week, in which case submission will take 2-3 weeks from today. So i suggest to wait. please checkout: https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=testing/mlx5e-stats&id=15ffb7f87432d073e8ac0e6b895188d40fdda4d4
Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
On 04/26/2018 02:50 PM, Saeed Mahameed wrote: On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang wrote: Current stats collecting scheme in mlx5 driver is to periodically fetch aggregated stats from all the active mlx5 software channels associated with the device. However when a mlx5 interface is brought down(ifdown), all the channels will be deactivated and closed. A new set of channels will be created when next ifup command or a similar command is called. Unfortunately the new channels will have all stats reset to 0. So you lose the accumulated stats information. This behavior is different from other netdev drivers including the mlx4 driver. In order to fix it, we now save prior mlx5 software stats into netdev stats fields, so all the accumulated stats will survive multiple runs of ifdown/ifup commands and be shown correctly. Orabug: 27548610 Signed-off-by: Qing Huang --- Hi Qing, I am adding Eran since he is currently working on a similar patch, He is also taking care of all cores/rings stats to make them persistent, so you won't have discrepancy between ethtool and ifconfig stats. I am ok with this patch, but this means Eran has to work his way around it. so we have two options: 1. Temporary accept this patch, and change it later with Eran's work. 2. Wait for Eran's work. I am ok with either one of them, please let me know. Thanks ! Hi Saeed, Any idea on rough ETA of Eran's stats work to be in upstream? If it will be available soon, I think we can wait a bit. If it will take a while to redesign the whole stats scheme (for both ethtool and ifconfig), maybe we can go with this incremental fix first? Thanks! drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++ 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index f1fe490..5d50e69 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device *netdev) netdev_set_tc_queue(netdev, tc, nch, 0); } +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv) +{ + struct net_device *netdev = priv->netdev; + + netdev->stats.rx_packets += priv->stats.sw.rx_packets; + netdev->stats.rx_bytes += priv->stats.sw.rx_bytes; + netdev->stats.tx_packets += priv->stats.sw.tx_packets; + netdev->stats.tx_bytes += priv->stats.sw.tx_bytes; + netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped; + + priv->stats.sw.rx_packets = 0; + priv->stats.sw.rx_bytes = 0; + priv->stats.sw.tx_packets = 0; + priv->stats.sw.tx_bytes = 0; + priv->stats.sw.tx_queue_dropped = 0; +} + This means that we are now explicitly clearing channels stats on ifconfig down or switch_channels. and now after ifconfing down, ethtool will always show 0, before this patch it didn't. Anyway update sw stats function will always override them with the new channels stats next time we load new channels. so it is not that big of a deal. static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv) { struct mlx5e_channel *c; @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv, netif_set_real_num_tx_queues(netdev, new_num_txqs); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(&priv->channels); priv->channels = *new_chs; @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev) netif_carrier_off(priv->netdev); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(&priv->channels); return 0; @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type, stats->tx_packets = PPORT_802_3_GET(pstats, a_frames_transmitted_ok); stats->tx_bytes = PPORT_802_3_GET(pstats, a_octets_transmitted_ok); } else { - stats->rx_packets = sstats->rx_packets; - stats->rx_bytes = sstats->rx_bytes; - stats->tx_packets = sstats->tx_packets; - stats->tx_bytes = sstats->tx_bytes; - stats->tx_dropped = sstats->tx_queue_dropped; + stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets; + stats->rx_bytes = sstats->rx_bytes + dev->stats.rx_bytes; + stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets; + stats->tx_bytes = sstats->tx_bytes + dev->stats.tx_bytes; + stats->tx_dropped = sstats->tx_queue_dropped + + dev->stats.tx_dropped; } stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer; -- 1.8.3.1
[PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
Current stats collecting scheme in mlx5 driver is to periodically fetch aggregated stats from all the active mlx5 software channels associated with the device. However when a mlx5 interface is brought down(ifdown), all the channels will be deactivated and closed. A new set of channels will be created when next ifup command or a similar command is called. Unfortunately the new channels will have all stats reset to 0. So you lose the accumulated stats information. This behavior is different from other netdev drivers including the mlx4 driver. In order to fix it, we now save prior mlx5 software stats into netdev stats fields, so all the accumulated stats will survive multiple runs of ifdown/ifup commands and be shown correctly. Orabug: 27548610 Signed-off-by: Qing Huang --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++ 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index f1fe490..5d50e69 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device *netdev) netdev_set_tc_queue(netdev, tc, nch, 0); } +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv) +{ + struct net_device *netdev = priv->netdev; + + netdev->stats.rx_packets += priv->stats.sw.rx_packets; + netdev->stats.rx_bytes += priv->stats.sw.rx_bytes; + netdev->stats.tx_packets += priv->stats.sw.tx_packets; + netdev->stats.tx_bytes += priv->stats.sw.tx_bytes; + netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped; + + priv->stats.sw.rx_packets = 0; + priv->stats.sw.rx_bytes = 0; + priv->stats.sw.tx_packets = 0; + priv->stats.sw.tx_bytes = 0; + priv->stats.sw.tx_queue_dropped = 0; +} + static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv) { struct mlx5e_channel *c; @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv, netif_set_real_num_tx_queues(netdev, new_num_txqs); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(&priv->channels); priv->channels = *new_chs; @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev) netif_carrier_off(priv->netdev); mlx5e_deactivate_priv_channels(priv); + mlx5e_netdev_save_stats(priv); mlx5e_close_channels(&priv->channels); return 0; @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type, stats->tx_packets = PPORT_802_3_GET(pstats, a_frames_transmitted_ok); stats->tx_bytes = PPORT_802_3_GET(pstats, a_octets_transmitted_ok); } else { - stats->rx_packets = sstats->rx_packets; - stats->rx_bytes = sstats->rx_bytes; - stats->tx_packets = sstats->tx_packets; - stats->tx_bytes = sstats->tx_bytes; - stats->tx_dropped = sstats->tx_queue_dropped; + stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets; + stats->rx_bytes = sstats->rx_bytes + dev->stats.rx_bytes; + stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets; + stats->tx_bytes = sstats->tx_bytes + dev->stats.tx_bytes; + stats->tx_dropped = sstats->tx_queue_dropped + + dev->stats.tx_dropped; } stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer; -- 1.8.3.1
Re: Setting large MTU size on slave interfaces may stall the whole system
Hi Or, On 12/13/2017 06:28 AM, Or Gerlitz wrote: On Tue, Dec 12, 2017 at 5:21 AM, Qing Huang wrote: Hi, We found an issue with the bonding driver when testing Mellanox devices. The following test commands will stall the whole system sometimes, with serial console flooded with log messages from the bond_miimon_inspect() function. Setting mtu size to be 1500 seems okay but very rarely it may hit the same problem too. ip address flush dev ens3f0 ip link set dev ens3f0 down ip address flush dev ens3f1 ip link set dev ens3f1 down [root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1 updelay=500 downdelay=500 [root@ca-hcl629 etc]# ifconfig bond0 up [root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1 [root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up Seiral console output: ** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for interface ens3f0, disabling it in 500 ms [..] It seems that when setting a large mtu size on an RoCE interface, the RTNL mutex may be held too long by the slave interface, causing bond_mii_monitor() to be called repeatedly at an interval of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive. Did you try/managed to reproduce that also with other NIC drivers? This seems to be an issue with the bonding driver. Also running older kernels on the same hardware without commit (de77ecd4ef02) seems to work fine. We can try to reproduce the issue with other type of NIC hardware. Or.
Setting large MTU size on slave interfaces may stall the whole system
(resend this email in text format) Hi, We found an issue with the bonding driver when testing Mellanox devices. The following test commands will stall the whole system sometimes, with serial console flooded with log messages from the bond_miimon_inspect() function. Setting mtu size to be 1500 seems okay but very rarely it may hit the same problem too. ip address flush dev ens3f0 ip link set dev ens3f0 down ip address flush dev ens3f1 ip link set dev ens3f1 down [root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1 updelay=500 downdelay=500 [root@ca-hcl629 etc]# ifconfig bond0 up [root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1 [root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up Seiral console output: ** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 5 printk messages dropped ** [ 3717.755737] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 5 printk messages dropped ** [ 3717.767758] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 4 printk messages dropped ** [ 3717.37] bond0: link status down for interface ens3f0, disabling it in 500 ms or ** 4 printk messages dropped ** [274743.297863] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.307866] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.317857] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.327823] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.337817] bond0: link status down again after 500 ms for interface enp48s0f1 The root cause is the combined affect from commit 1f2cd845d3827412e82bf26dde0abca332ede402(Revert "Merge branch 'bonding_monitor_locking'") and commit de77ecd4ef02ca783f7762e04e92b3d0964be66b ("bonding: improve link-status update in mii-monitoring"). E.g. reverting the second commit, we don't see the problem. It seems that when setting a large mtu size on an RoCE interface, the RTNL mutex may be held too long by the slave interface, causing bond_mii_monitor() to be called repeatedly at an interval of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive. We found two possible solutions: #1, don't re-arm the mii monitor thread too quick if we cannot get RTNL lock: index b2db581..8fd587a 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2266,7 +2266,6 @@ static void bond_mii_monitor(struct work_struct *work) /* Race avoidance with bond_close cancel of workqueue */ if (!rtnl_trylock()) { - delay = 1; should_notify_peers = false; goto re_arm; } #2, we use printk_ratelimit() to avoid flooding log messages generated by bond_miimon_inspect(). index b2db581..0183b7f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2054,7 +2054,7 @@ static int bond_miimon_inspect(struct bonding *bond) bond_propose_link_state(slave, BOND_LINK_FAIL); commit++; slave->delay = bond->params.downdelay; - if (slave->delay) { + if (slave->delay && printk_ratelimit()) { netdev_info(bond->dev, "link status down for %sinterface %s, disabling it in %d ms\n", (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP) ? @@ -2105,7 +2105,8 @@ static int bond_miimon_inspect(struct bonding *bond) case BOND_LINK_BACK: if (!link_state) { bond_propose_link_state(slave, BOND_LINK_DOWN); - netdev_info(bond->dev, "link status down again after %d ms for interface %s\n", + if(printk_ratelimit()) + netdev_info(bond->dev, "link status down again after %d ms for interface %s\n", (bond->params.updelay - slave->delay) * bond->params.miimon, slave->dev->name); Regarding the flooding messages, the netdev_info output is misleading anyway when bond_mii_monitor() is called at 1 tick interval due to lock contention. Solution #1 looks simpler and cleaner to me. Any side affect of doing that? Thanks, Qing