Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"
On 9/11/2024 7:22 AM, Michael S. Tsirkin wrote: Thanks a lot! Could you retest Xuan Zhuo original patch Which one? I thought Darren already did so? -Siwei just to make sure it does not fix the issue? On Wed, Sep 11, 2024 at 03:18:55PM +0100, Darren Kenny wrote: For the record, I got a chance to test these changes and confirmed that they resolved the issue for me when applied on 6.11-rc7. Tested-by: Darren Kenny Thanks, Darren. PS - I'll try get to looking at the other potential fix when I have time. On Tuesday, 2024-09-10 at 08:12:06 -04, Michael S. Tsirkin wrote: On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote: Regression: http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com I still think that the patch can fix the problem, I hope Darren can re-test it or give me more info. http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com If that can not work or Darren can not reply in time, Michael you can try this patch set. Just making sure netdev maintainers see this, this patch is for net. Thanks. Xuan Zhuo (3): Revert "virtio_net: rx remove premapped failover code" Revert "virtio_net: big mode skip the unmap check" virtio_net: disable premapped mode by default drivers/net/virtio_net.c | 95 +++- 1 file changed, 46 insertions(+), 49 deletions(-) -- 2.32.0.3.g01195cf9f
Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc
Just in case Xuan missed the last email while his email server kept rejecting incoming emails in the last week.: the patch doesn't seem fix the regression. Xuan, given this is not very hard to reproduce and we have clearly stated how to, could you try to get the patch verified in house before posting to upstream? Or you were unable to reproduce locally? Thanks, -Siwei On 8/21/2024 9:47 AM, Darren Kenny wrote: Hi Michael, On Tuesday, 2024-08-20 at 12:50:39 -04, Michael S. Tsirkin wrote: On Tue, Aug 20, 2024 at 03:19:13PM +0800, Xuan Zhuo wrote: leads to regression on VM with the sysctl value of: - net.core.high_order_alloc_disable=1 which could see reliable crashes or scp failure (scp a file 100M in size to VM): The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning of a new frag. When the frag size is larger than PAGE_SIZE, everything is fine. However, if the frag is only one page and the total size of the buffer and virtnet_rq_dma is larger than one page, an overflow may occur. In this case, if an overflow is possible, I adjust the buffer size. If net.core.high_order_alloc_disable=1, the maximum buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only the first buffer of the frag is affected. Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever use_dma_api") Reported-by: "Si-Wei Liu" Closes: http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com Signed-off-by: Xuan Zhuo Darren, could you pls test and confirm? Unfortunately with this change I seem to still get a panic as soon as I start a download using wget: [ 144.055630] Kernel panic - not syncing: corrupted stack end detected inside scheduler [ 144.056249] CPU: 8 PID: 37894 Comm: sleep Kdump: loaded Not tainted 6.10.0-1.el8uek.x86_64 #2 [ 144.056850] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-4.module+el8.9.0+90173+a3f3e83a 04/01/2014 [ 144.057585] Call Trace: [ 144.057791] [ 144.057973] panic+0x347/0x370 [ 144.058223] schedule_debug.isra.0+0xfb/0x100 [ 144.058565] __schedule+0x58/0x6a0 [ 144.058838] ? refill_stock+0x26/0x50 [ 144.059120] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.059473] do_task_dead+0x42/0x50 [ 144.059752] do_exit+0x31e/0x4b0 [ 144.060011] ? __audit_syscall_entry+0xee/0x150 [ 144.060352] do_group_exit+0x30/0x80 [ 144.060633] __x64_sys_exit_group+0x18/0x20 [ 144.060946] do_syscall_64+0x8c/0x1c0 [ 144.061228] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.061570] ? __audit_filter_op+0xbe/0x140 [ 144.061873] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.062204] ? audit_reset_context+0x232/0x310 [ 144.062514] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.062851] ? syscall_exit_work+0x103/0x130 [ 144.063148] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.063473] ? syscall_exit_to_user_mode+0x77/0x220 [ 144.063813] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.064142] ? do_syscall_64+0xb9/0x1c0 [ 144.064411] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.064747] ? do_syscall_64+0xb9/0x1c0 [ 144.065018] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.065345] ? do_read_fault+0x109/0x1b0 [ 144.065628] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.065961] ? do_fault+0x1aa/0x2f0 [ 144.066212] ? handle_pte_fault+0x102/0x1a0 [ 144.066503] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.066836] ? __handle_mm_fault+0x5ed/0x710 [ 144.067137] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.067464] ? __count_memcg_events+0x72/0x110 [ 144.067779] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.068106] ? count_memcg_events.constprop.0+0x26/0x50 [ 144.068457] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.068788] ? handle_mm_fault+0xae/0x320 [ 144.069068] ? srso_alias_return_thunk+0x5/0xfbef5 [ 144.069395] ? do_user_addr_fault+0x34a/0x6b0 [ 144.069708] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 144.070049] RIP: 0033:0x7fc5524f9c66 [ 144.070307] Code: Unable to access opcode bytes at 0x7fc5524f9c3c. [ 144.070720] RSP: 002b:7ffee052beb8 EFLAGS: 0246 ORIG_RAX: 00e7 [ 144.071214] RAX: ffda RBX: 7fc5527bb860 RCX: 7fc5524f9c66 [ 144.071684] RDX: RSI: 003c RDI: [ 144.072146] RBP: R08: 00e7 R09: ff78 [ 144.072608] R10: 7ffee052bdef R11: 0246 R12: 7fc5527bb860 [ 144.073076] R13: 0002 R14: 7fc5527c4528 R15: [ 144.073543] [ 144.074780] Kernel Offset: 0x37c0 from 0x8100 (relocation range: 0x8000-0xbfff) Thanks, Darren. --- drivers/net/virtio_net.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c6af18948092..e5286a6da863 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct rec
Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc
On 8/20/2024 1:09 PM, Michael S. Tsirkin wrote: On Tue, Aug 20, 2024 at 12:44:46PM -0700, Si-Wei Liu wrote: On 8/20/2024 12:19 AM, Xuan Zhuo wrote: leads to regression on VM with the sysctl value of: - net.core.high_order_alloc_disable=1 which could see reliable crashes or scp failure (scp a file 100M in size to VM): The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning of a new frag. When the frag size is larger than PAGE_SIZE, everything is fine. However, if the frag is only one page and the total size of the buffer and virtnet_rq_dma is larger than one page, an overflow may occur. In this case, if an overflow is possible, I adjust the buffer size. If net.core.high_order_alloc_disable=1, the maximum buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only the first buffer of the frag is affected. Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever use_dma_api") Reported-by: "Si-Wei Liu" Closes: http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c6af18948092..e5286a6da863 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp) void *buf, *head; dma_addr_t addr; - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp))) - return NULL; - head = page_address(alloc_frag->page); dma = head; @@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq, len = SKB_DATA_ALIGN(len) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp))) + return -ENOMEM; + Do you want to document the assumption that small packet case won't end up crossing the page frag boundary unlike the mergeable case? Add a comment block to explain or a WARN_ON() check against potential overflow would work with me. buf = virtnet_rq_alloc(rq, len, gfp); if (unlikely(!buf)) return -ENOMEM; @@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, */ len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room); + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp))) + return -ENOMEM; + + if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > alloc_frag->size) + len -= sizeof(struct virtnet_rq_dma); + This could address my previous concern for possibly regressing every buffer size for the mergeable case, thanks. Though I still don't get why carving up a small chunk from page_frag for storing the virtnet_rq_dma metadata, this would cause perf regression on certain MTU size 4Kbyte MTU exactly? Close to that, excluding all headers upfront (depending on virtio features and header layout). The size of struct virtnet_rq_dma is now 16 bytes, this could lead to performance impact on roughly: 16 / 4096 = 0.4 % across all MTU sizes, more obviously to be seen with order-0 page allocations. -Siwei that happens to end up with one more base page (and an extra descriptor as well) to be allocated compared to the previous code without the extra virtnet_rq_dma content. How hard would it be to allocate a dedicated struct to store the related information without affecting the (size of) datapath pages? FWIW, out of the code review perspective, I've looked up the past conversations but didn't see comprehensive benchmark was done before removing the old code and making premap the sole default mode. Granted this would reduce the footprint of additional code and the associated maintaining cost immediately, but I would assume at least there should have been thorough performance runs upfront to guarantee no regression is seen with every possible use case, or the negative effect is comparatively negligible even though there's slight regression in some limited case. If that kind of perf measurement hadn't been done before getting accepted/merged, I think at least it should allow both modes to coexist for a while such that every user could gauge the performance effect. Thanks, -Siwei buf = virtnet_rq_alloc(rq, len + room, gfp); if (unlikely(!buf)) return -ENOMEM;
Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc
On 8/20/2024 12:19 AM, Xuan Zhuo wrote: leads to regression on VM with the sysctl value of: - net.core.high_order_alloc_disable=1 which could see reliable crashes or scp failure (scp a file 100M in size to VM): The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning of a new frag. When the frag size is larger than PAGE_SIZE, everything is fine. However, if the frag is only one page and the total size of the buffer and virtnet_rq_dma is larger than one page, an overflow may occur. In this case, if an overflow is possible, I adjust the buffer size. If net.core.high_order_alloc_disable=1, the maximum buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only the first buffer of the frag is affected. Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever use_dma_api") Reported-by: "Si-Wei Liu" Closes: http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c6af18948092..e5286a6da863 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp) void *buf, *head; dma_addr_t addr; - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp))) - return NULL; - head = page_address(alloc_frag->page); dma = head; @@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq, len = SKB_DATA_ALIGN(len) + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp))) + return -ENOMEM; + Do you want to document the assumption that small packet case won't end up crossing the page frag boundary unlike the mergeable case? Add a comment block to explain or a WARN_ON() check against potential overflow would work with me. buf = virtnet_rq_alloc(rq, len, gfp); if (unlikely(!buf)) return -ENOMEM; @@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, */ len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room); + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp))) + return -ENOMEM; + + if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > alloc_frag->size) + len -= sizeof(struct virtnet_rq_dma); + This could address my previous concern for possibly regressing every buffer size for the mergeable case, thanks. Though I still don't get why carving up a small chunk from page_frag for storing the virtnet_rq_dma metadata, this would cause perf regression on certain MTU size that happens to end up with one more base page (and an extra descriptor as well) to be allocated compared to the previous code without the extra virtnet_rq_dma content. How hard would it be to allocate a dedicated struct to store the related information without affecting the (size of) datapath pages? FWIW, out of the code review perspective, I've looked up the past conversations but didn't see comprehensive benchmark was done before removing the old code and making premap the sole default mode. Granted this would reduce the footprint of additional code and the associated maintaining cost immediately, but I would assume at least there should have been thorough performance runs upfront to guarantee no regression is seen with every possible use case, or the negative effect is comparatively negligible even though there's slight regression in some limited case. If that kind of perf measurement hadn't been done before getting accepted/merged, I think at least it should allow both modes to coexist for a while such that every user could gauge the performance effect. Thanks, -Siwei buf = virtnet_rq_alloc(rq, len + room, gfp); if (unlikely(!buf)) return -ENOMEM;
Re: [PATCH net-next v5 1/4] virtio_ring: enable premapped mode whatever use_dma_api
Hi, May I know if this is really an intended fix to post officially, or just a workaround/probe to make the offset in page_frag happy when net_high_order_alloc_disable is true? In case it's the former, even though this could fix the issue, I would assume clamping to a smaller page_frag than a regular page size for every buffer may have certain performance regression for the merge-able buffer case? Can you justify the performance impact with some benchmark runs with larger MTU and merge-able rx buffers to prove the regression is negligible? You would need to compare against where you don't have the inadvertent virtnet_rq_dma cost on any page i.e. getting all 4 patches of this series reverted. Both tests with net_high_order_alloc_disable set to on and off are needed. Thanks, -Siwei On 8/17/2024 6:20 AM, Xuan Zhuo wrote: Hi, guys, I have a fix patch for this. Could anybody test it? Thanks. diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index af474cc191d0..426d68c2d01d 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2492,13 +2492,15 @@ static unsigned int get_mergeable_buf_len(struct receive_queue *rq, { struct virtnet_info *vi = rq->vq->vdev->priv; const size_t hdr_len = vi->hdr_len; - unsigned int len; + unsigned int len, max_len; + + max_len = PAGE_SIZE - ALIGN(sizeof(struct virtnet_rq_dma), L1_CACHE_BYTES); if (room) - return PAGE_SIZE - room; + return max_len - room; len = hdr_len + clamp_t(unsigned int, ewma_pkt_len_read(avg_pkt_len), - rq->min_buf_len, PAGE_SIZE - hdr_len); + rq->min_buf_len, max_len - hdr_len); return ALIGN(len, L1_CACHE_BYTES); }
Re: [PATCH net-next v5 1/4] virtio_ring: enable premapped mode whatever use_dma_api
Hi Michael, I'll look for someone else from Oracle to help you on this, as the relevant team already did verify internally that reverting all 4 patches from this series could help address the regression. Just reverting one single commit won't help. 9719f039d328 virtio_net: remove the misleading comment defd28aa5acb virtio_net: rx remove premapped failover code a377ae542d8d virtio_net: big mode skip the unmap check f9dac92ba908 virtio_ring: enable premapped mode whatever use_dma_api In case I fail to get someone to help, could you work with Darren (cc'ed) directly? He could reach out to the corresponding team in Oracle to help with testing. Thanks, -Siwei On 8/13/2024 12:46 PM, Michael S. Tsirkin wrote: Want to post a patchset to revert?
Re: [PATCH net-next v5 1/4] virtio_ring: enable premapped mode whatever use_dma_api
Turning out this below commit to unconditionally enable premapped virtio-net: commit f9dac92ba9081062a6477ee015bd3b8c5914efc4 Author: Xuan Zhuo Date: Sat May 11 11:14:01 2024 +0800 leads to regression on VM with no ACCESS_PLATFORM, and with the sysctl value of: - net.core.high_order_alloc_disable=1 which could see reliable crashes or scp failure (scp a file 100M in size to VM): [ 332.079333] __vm_enough_memory: pid: 18440, comm: sshd, bytes: 5285790347661783040 not enough memory for the allocation [ 332.079651] [ cut here ] [ 332.079655] kernel BUG at mm/mmap.c:3514! [ 332.080095] invalid opcode: [#1] PREEMPT SMP NOPTI [ 332.080826] CPU: 18 PID: 18440 Comm: sshd Kdump: loaded Not tainted 6.10.0-2.x86_64 #2 [ 332.081514] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-4.module+el8.9.0+90173+a3f3e83a 04/01/2014 [ 332.082451] RIP: 0010:exit_mmap+0x3a1/0x3b0 [ 332.082871] Code: be 01 00 00 00 48 89 df e8 0c 94 fe ff eb d7 be 01 00 00 00 48 89 df e8 5d 98 fe ff eb be 31 f6 48 89 df e8 31 99 fe ff eb a8 <0f> 0b e8 68 bc ae 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 [ 332.084230] RSP: 0018:9988b1c8f948 EFLAGS: 00010293 [ 332.084635] RAX: 0406 RBX: 8d47583e7380 RCX: [ 332.085171] RDX: RSI: RDI: [ 332.085699] RBP: 008f R08: R09: [ 332.086233] R10: R11: R12: 8d47583e7430 [ 332.086761] R13: 8d47583e73c0 R14: 0406 R15: 000495ae650dda58 [ 332.087300] FS: 7ff443899980() GS:8df1c570() knlGS: [ 332.087888] CS: 0010 DS: ES: CR0: 80050033 [ 332.088334] CR2: 55a42d30b730 CR3: 0102e956a004 CR4: 00770ef0 [ 332.088867] PKRU: 5554 [ 332.089114] Call Trace: [ 332.089349] [ 332.089556] ? die+0x36/0x90 [ 332.089818] ? do_trap+0xed/0x110 [ 332.090110] ? exit_mmap+0x3a1/0x3b0 [ 332.090411] ? do_error_trap+0x6a/0xa0 [ 332.090722] ? exit_mmap+0x3a1/0x3b0 [ 332.091029] ? exc_invalid_op+0x50/0x80 [ 332.091348] ? exit_mmap+0x3a1/0x3b0 [ 332.091648] ? asm_exc_invalid_op+0x1a/0x20 [ 332.091998] ? exit_mmap+0x3a1/0x3b0 [ 332.092299] ? exit_mmap+0x1d6/0x3b0 [ 332.092604] __mmput+0x3e/0x130 [ 332.092882] dup_mm.constprop.0+0x10c/0x110 [ 332.093226] copy_process+0xbd0/0x1570 [ 332.093539] kernel_clone+0xbf/0x430 [ 332.093838] ? syscall_exit_work+0x103/0x130 [ 332.094197] __do_sys_clone+0x66/0xa0 [ 332.094506] do_syscall_64+0x8c/0x1d0 [ 332.094814] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.095198] ? audit_reset_context+0x232/0x310 [ 332.095558] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.095936] ? syscall_exit_work+0x103/0x130 [ 332.096288] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.096668] ? syscall_exit_to_user_mode+0x7d/0x220 [ 332.097059] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.097436] ? do_syscall_64+0xba/0x1d0 [ 332.097752] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.098137] ? syscall_exit_to_user_mode+0x7d/0x220 [ 332.098525] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.098903] ? do_syscall_64+0xba/0x1d0 [ 332.099227] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.099606] ? __audit_filter_op+0xbe/0x140 [ 332.099943] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.100328] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.100706] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.101089] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.101468] ? wp_page_reuse+0x8e/0xb0 [ 332.101779] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.102163] ? do_wp_page+0xe6/0x470 [ 332.102465] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.102843] ? __handle_mm_fault+0x5ff/0x720 [ 332.103197] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.103574] ? __count_memcg_events+0x4d/0xd0 [ 332.103938] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.104323] ? count_memcg_events.constprop.0+0x26/0x50 [ 332.104729] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.105114] ? handle_mm_fault+0xae/0x320 [ 332.105442] ? srso_alias_return_thunk+0x5/0xfbef5 [ 332.105820] ? do_user_addr_fault+0x31f/0x6c0 [ 332.106181] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 332.106576] RIP: 0033:0x7ff43f8f9a73 [ 332.106876] Code: db 0f 85 28 01 00 00 64 4c 8b 0c 25 10 00 00 00 45 31 c0 4d 8d 91 d0 02 00 00 31 d2 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 b9 00 00 00 41 89 c5 85 c0 0f 85 c6 00 00 [ 332.108163] RSP: 002b:7ffc690909b0 EFLAGS: 0246 ORIG_RAX: 0038 [ 332.108719] RAX: ffda RBX: RCX: 7ff43f8f9a73 [ 332.109253] RDX: RSI: RDI: 01200011 [ 332.109782] RBP: R08: R09: 7ff443899980 [ 332.110313] R10: 7ff443899c50 R11: 0246 R12: 0002 [ 332.110842] R13: 562e56cd4780 R14:
Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
On Wed, Apr 07, 2021 at 02:34:01PM +, Haiyang Zhang wrote: > > > > -Original Message- > > From: Wei Liu > > Sent: Wednesday, April 7, 2021 9:17 AM > > To: Dexuan Cui > > Cc: da...@davemloft.net; k...@kernel.org; KY Srinivasan > > ; Haiyang Zhang ; Stephen > > Hemminger ; wei@kernel.org; Wei Liu > > ; netdev@vger.kernel.org; linux- > > ker...@vger.kernel.org; linux-hyp...@vger.kernel.org > > Subject: Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure > > Network Adapter (MANA) > > > > On Tue, Apr 06, 2021 at 04:23:21PM -0700, Dexuan Cui wrote: > > [...] > > > +config MICROSOFT_MANA > > > + tristate "Microsoft Azure Network Adapter (MANA) support" > > > + default m > > > + depends on PCI_MSI > > > + select PCI_HYPERV > > > > OOI which part of the code requires PCI_HYPERV? > > > > Asking because I can't immediately find code that looks to be Hyper-V > > specific (searching for vmbus etc). This device looks like any other PCI > > devices > > to me. > > It depends on the VF nic's PCI config space which is presented by the > pci_hyperv driver. I think all it matters is the PCI bus is able to handle the configuration space access, right? Assuming there is an emulated PCI root complex which exposes the config space to the driver, will this driver still work? I'm trying to understand how tightly coupled with Hyper-V PCI this driver is. In an alternative universe, Microsft may suddenly decide to sell this hardware and someone wants to passthrough an VF via VFIO. I don't see how this driver wouldn't work, hence the original question. There is no need to change the code. I'm just curious about a tiny detail in the implementation. Wei. > > Thanks, > - Haiyang
Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
On Tue, Apr 06, 2021 at 04:23:21PM -0700, Dexuan Cui wrote: [...] > +config MICROSOFT_MANA > + tristate "Microsoft Azure Network Adapter (MANA) support" > + default m > + depends on PCI_MSI > + select PCI_HYPERV OOI which part of the code requires PCI_HYPERV? Asking because I can't immediately find code that looks to be Hyper-V specific (searching for vmbus etc). This device looks like any other PCI devices to me. Wei.
Re: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)
On Wed, Apr 07, 2021 at 08:08:59AM +, Dexuan Cui wrote: > > From: kernel test robot > > Sent: Tuesday, April 6, 2021 6:31 PM > > ... > > Hi Dexuan, > > I love your patch! Perhaps something to improve: > > > > All warnings (new ones prefixed by >>): > > > >drivers/pci/controller/pci-hyperv.c: In function 'hv_irq_unmask': > >drivers/pci/controller/pci-hyperv.c:1220:2: error: implicit declaration > > of > > function 'hv_set_msi_entry_from_desc' > > [-Werror=implicit-function-declaration] > > 1220 | hv_set_msi_entry_from_desc(¶ms->int_entry.msi_entry, > > msi_desc); > > This build error looks strange, because the patch doesn't even touch the > driver > drivers/pci/controller/pci-hyperv.c. > I think this is normal. The bot doesn't restrict itself to the changed code from my experience. Wei.
Re: [patch 12/14] PCI: hv: Use tasklet_disable_in_atomic()
On Tue, Mar 09, 2021 at 09:42:15AM +0100, Thomas Gleixner wrote: > From: Sebastian Andrzej Siewior > > The hv_compose_msi_msg() callback in irq_chip::irq_compose_msi_msg is > invoked via irq_chip_compose_msi_msg(), which itself is always invoked from > atomic contexts from the guts of the interrupt core code. > > There is no way to change this w/o rewriting the whole driver, so use > tasklet_disable_in_atomic() which allows to make tasklet_disable() > sleepable once the remaining atomic users are addressed. > > Signed-off-by: Sebastian Andrzej Siewior > Signed-off-by: Thomas Gleixner Acked-by: Wei Liu
Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero
On 2/28/2021 1:27 PM, Michael S. Tsirkin wrote: On Thu, Feb 25, 2021 at 04:56:42PM -0800, Si-Wei Liu wrote: Hi Michael, Are you okay to live without this ioctl for now? I think QEMU is the one that needs to be fixed and will have to be made legacy guest aware. I think the kernel can just honor the feature negotiation result done by QEMU and do as what's told to.Will you agree? If it's fine, I would proceed to reverting commit fe36cbe067 and related code in question from the kernel. Thanks, -Siwei Not really, I don't see why that's a good idea. fe36cbe067 is the code checking MTU before FEATURES_OK. Spec explicitly allows that. Alright, but what I meant was this commit 452639a64ad8 ("vdpa: make sure set_features is invoked for legacy"). But I got why you need it in another email (for BE host/guest). -Siwei
Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero
Hi Michael, Are you okay to live without this ioctl for now? I think QEMU is the one that needs to be fixed and will have to be made legacy guest aware. I think the kernel can just honor the feature negotiation result done by QEMU and do as what's told to.Will you agree? If it's fine, I would proceed to reverting commit fe36cbe067 and related code in question from the kernel. Thanks, -Siwei On 2/24/2021 10:24 AM, Si-Wei Liu wrote: Detecting it isn't enough though, we will need a new ioctl to notify the kernel that it's a legacy guest. Ugh :( Well, although I think adding an ioctl is doable, may I know what the use case there will be for kernel to leverage such info directly? Is there a case QEMU can't do with dedicate ioctls later if there's indeed differentiation (legacy v.s. modern) needed? One of the reason I asked is if this ioctl becomes a mandate for vhost-vdpa kernel. QEMU would reject initialize vhost-vdpa if doesn't see this ioctl coming? If it's optional, suppose the kernel may need it only when it becomes necessary? Thanks, -Siwei
Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero
On 2/23/2021 9:04 PM, Michael S. Tsirkin wrote: On Tue, Feb 23, 2021 at 11:35:57AM -0800, Si-Wei Liu wrote: On 2/23/2021 5:26 AM, Michael S. Tsirkin wrote: On Tue, Feb 23, 2021 at 10:03:57AM +0800, Jason Wang wrote: On 2021/2/23 9:12 上午, Si-Wei Liu wrote: On 2/21/2021 11:34 PM, Michael S. Tsirkin wrote: On Mon, Feb 22, 2021 at 12:14:17PM +0800, Jason Wang wrote: On 2021/2/19 7:54 下午, Si-Wei Liu wrote: Commit 452639a64ad8 ("vdpa: make sure set_features is invoked for legacy") made an exception for legacy guests to reset features to 0, when config space is accessed before features are set. We should relieve the verify_min_features() check and allow features reset to 0 for this case. It's worth noting that not just legacy guests could access config space before features are set. For instance, when feature VIRTIO_NET_F_MTU is advertised some modern driver will try to access and validate the MTU present in the config space before virtio features are set. This looks like a spec violation: " The following driver-read-only field, mtu only exists if VIRTIO_NET_F_MTU is set. This field specifies the maximum MTU for the driver to use. " Do we really want to workaround this? Thanks And also: The driver MUST follow this sequence to initialize a device: 1. Reset the device. 2. Set the ACKNOWLEDGE status bit: the guest OS has noticed the device. 3. Set the DRIVER status bit: the guest OS knows how to drive the device. 4. Read device feature bits, and write the subset of feature bits understood by the OS and driver to the device. During this step the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before accepting it. 5. Set the FEATURES_OK status bit. The driver MUST NOT accept new feature bits after this step. 6. Re-read device status to ensure the FEATURES_OK bit is still set: otherwise, the device does not support our subset of features and the device is unusable. 7. Perform device-specific setup, including discovery of virtqueues for the device, optional per-bus setup, reading and possibly writing the device’s virtio configuration space, and population of virtqueues. 8. Set the DRIVER_OK status bit. At this point the device is “live”. so accessing config space before FEATURES_OK is a spec violation, right? It is, but it's not relevant to what this commit tries to address. I thought the legacy guest still needs to be supported. Having said, a separate patch has to be posted to fix the guest driver issue where this discrepancy is introduced to virtnet_validate() (since commit fe36cbe067). But it's not technically related to this patch. -Siwei I think it's a bug to read config space in validate, we should move it to virtnet_probe(). Thanks I take it back, reading but not writing seems to be explicitly allowed by spec. So our way to detect a legacy guest is bogus, need to think what is the best way to handle this. Then maybe revert commit fe36cbe067 and friends, and have QEMU detect legacy guest? Supposedly only config space write access needs to be guarded before setting FEATURES_OK. -Siwie Detecting it isn't enough though, we will need a new ioctl to notify the kernel that it's a legacy guest. Ugh :( Well, although I think adding an ioctl is doable, may I know what the use case there will be for kernel to leverage such info directly? Is there a case QEMU can't do with dedicate ioctls later if there's indeed differentiation (legacy v.s. modern) needed? One of the reason I asked is if this ioctl becomes a mandate for vhost-vdpa kernel. QEMU would reject initialize vhost-vdpa if doesn't see this ioctl coming? If it's optional, suppose the kernel may need it only when it becomes necessary? Thanks, -Siwei Rejecting reset to 0 prematurely causes correct MTU and link status unable to load for the very first config space access, rendering issues like guest showing inaccurate MTU value, or failure to reject out-of-range MTU. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +-- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 7c1f789..540dd67 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1490,14 +1490,6 @@ static u64 mlx5_vdpa_get_features(struct vdpa_device *vdev) return mvdev->mlx_features; } -static int verify_min_features(struct mlx5_vdpa_dev *mvdev, u64 features) -{ - if (!(features & BIT_ULL(VIRTIO_F_ACCESS_PLATFORM))) - return -EOPNOTSUPP; - - return 0; -} - static int setup_virtqueues(struct mlx5_vdpa_net *ndev) { int err; @@ -1558,18 +1550,13 @@ static int mlx5_vdpa_set_features(struct vdpa_device *vdev, u64
Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero
On 2/23/2021 5:26 AM, Michael S. Tsirkin wrote: On Tue, Feb 23, 2021 at 10:03:57AM +0800, Jason Wang wrote: On 2021/2/23 9:12 上午, Si-Wei Liu wrote: On 2/21/2021 11:34 PM, Michael S. Tsirkin wrote: On Mon, Feb 22, 2021 at 12:14:17PM +0800, Jason Wang wrote: On 2021/2/19 7:54 下午, Si-Wei Liu wrote: Commit 452639a64ad8 ("vdpa: make sure set_features is invoked for legacy") made an exception for legacy guests to reset features to 0, when config space is accessed before features are set. We should relieve the verify_min_features() check and allow features reset to 0 for this case. It's worth noting that not just legacy guests could access config space before features are set. For instance, when feature VIRTIO_NET_F_MTU is advertised some modern driver will try to access and validate the MTU present in the config space before virtio features are set. This looks like a spec violation: " The following driver-read-only field, mtu only exists if VIRTIO_NET_F_MTU is set. This field specifies the maximum MTU for the driver to use. " Do we really want to workaround this? Thanks And also: The driver MUST follow this sequence to initialize a device: 1. Reset the device. 2. Set the ACKNOWLEDGE status bit: the guest OS has noticed the device. 3. Set the DRIVER status bit: the guest OS knows how to drive the device. 4. Read device feature bits, and write the subset of feature bits understood by the OS and driver to the device. During this step the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before accepting it. 5. Set the FEATURES_OK status bit. The driver MUST NOT accept new feature bits after this step. 6. Re-read device status to ensure the FEATURES_OK bit is still set: otherwise, the device does not support our subset of features and the device is unusable. 7. Perform device-specific setup, including discovery of virtqueues for the device, optional per-bus setup, reading and possibly writing the device’s virtio configuration space, and population of virtqueues. 8. Set the DRIVER_OK status bit. At this point the device is “live”. so accessing config space before FEATURES_OK is a spec violation, right? It is, but it's not relevant to what this commit tries to address. I thought the legacy guest still needs to be supported. Having said, a separate patch has to be posted to fix the guest driver issue where this discrepancy is introduced to virtnet_validate() (since commit fe36cbe067). But it's not technically related to this patch. -Siwei I think it's a bug to read config space in validate, we should move it to virtnet_probe(). Thanks I take it back, reading but not writing seems to be explicitly allowed by spec. So our way to detect a legacy guest is bogus, need to think what is the best way to handle this. Then maybe revert commit fe36cbe067 and friends, and have QEMU detect legacy guest? Supposedly only config space write access needs to be guarded before setting FEATURES_OK. -Siwie Rejecting reset to 0 prematurely causes correct MTU and link status unable to load for the very first config space access, rendering issues like guest showing inaccurate MTU value, or failure to reject out-of-range MTU. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +-- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 7c1f789..540dd67 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1490,14 +1490,6 @@ static u64 mlx5_vdpa_get_features(struct vdpa_device *vdev) return mvdev->mlx_features; } -static int verify_min_features(struct mlx5_vdpa_dev *mvdev, u64 features) -{ - if (!(features & BIT_ULL(VIRTIO_F_ACCESS_PLATFORM))) - return -EOPNOTSUPP; - - return 0; -} - static int setup_virtqueues(struct mlx5_vdpa_net *ndev) { int err; @@ -1558,18 +1550,13 @@ static int mlx5_vdpa_set_features(struct vdpa_device *vdev, u64 features) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); - int err; print_features(mvdev, features, true); - err = verify_min_features(mvdev, features); - if (err) - return err; - ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features; ndev->config.mtu = cpu_to_mlx5vdpa16(mvdev, ndev->mtu); ndev->config.status |= cpu_to_mlx5vdpa16(mvdev, VIRTIO_NET_S_LINK_UP); - return err; + return 0; } static void mlx5_vdpa_set_config_cb(struct vdpa_device *vdev, struct vdpa_callback *cb)
Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero
On 2/21/2021 11:34 PM, Michael S. Tsirkin wrote: On Mon, Feb 22, 2021 at 12:14:17PM +0800, Jason Wang wrote: On 2021/2/19 7:54 下午, Si-Wei Liu wrote: Commit 452639a64ad8 ("vdpa: make sure set_features is invoked for legacy") made an exception for legacy guests to reset features to 0, when config space is accessed before features are set. We should relieve the verify_min_features() check and allow features reset to 0 for this case. It's worth noting that not just legacy guests could access config space before features are set. For instance, when feature VIRTIO_NET_F_MTU is advertised some modern driver will try to access and validate the MTU present in the config space before virtio features are set. This looks like a spec violation: " The following driver-read-only field, mtu only exists if VIRTIO_NET_F_MTU is set. This field specifies the maximum MTU for the driver to use. " Do we really want to workaround this? Thanks And also: The driver MUST follow this sequence to initialize a device: 1. Reset the device. 2. Set the ACKNOWLEDGE status bit: the guest OS has noticed the device. 3. Set the DRIVER status bit: the guest OS knows how to drive the device. 4. Read device feature bits, and write the subset of feature bits understood by the OS and driver to the device. During this step the driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before accepting it. 5. Set the FEATURES_OK status bit. The driver MUST NOT accept new feature bits after this step. 6. Re-read device status to ensure the FEATURES_OK bit is still set: otherwise, the device does not support our subset of features and the device is unusable. 7. Perform device-specific setup, including discovery of virtqueues for the device, optional per-bus setup, reading and possibly writing the device’s virtio configuration space, and population of virtqueues. 8. Set the DRIVER_OK status bit. At this point the device is “live”. so accessing config space before FEATURES_OK is a spec violation, right? It is, but it's not relevant to what this commit tries to address. I thought the legacy guest still needs to be supported. Having said, a separate patch has to be posted to fix the guest driver issue where this discrepancy is introduced to virtnet_validate() (since commit fe36cbe067). But it's not technically related to this patch. -Siwei Rejecting reset to 0 prematurely causes correct MTU and link status unable to load for the very first config space access, rendering issues like guest showing inaccurate MTU value, or failure to reject out-of-range MTU. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +-- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 7c1f789..540dd67 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1490,14 +1490,6 @@ static u64 mlx5_vdpa_get_features(struct vdpa_device *vdev) return mvdev->mlx_features; } -static int verify_min_features(struct mlx5_vdpa_dev *mvdev, u64 features) -{ - if (!(features & BIT_ULL(VIRTIO_F_ACCESS_PLATFORM))) - return -EOPNOTSUPP; - - return 0; -} - static int setup_virtqueues(struct mlx5_vdpa_net *ndev) { int err; @@ -1558,18 +1550,13 @@ static int mlx5_vdpa_set_features(struct vdpa_device *vdev, u64 features) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); - int err; print_features(mvdev, features, true); - err = verify_min_features(mvdev, features); - if (err) - return err; - ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features; ndev->config.mtu = cpu_to_mlx5vdpa16(mvdev, ndev->mtu); ndev->config.status |= cpu_to_mlx5vdpa16(mvdev, VIRTIO_NET_S_LINK_UP); - return err; + return 0; } static void mlx5_vdpa_set_config_cb(struct vdpa_device *vdev, struct vdpa_callback *cb)
Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero
On 2/21/2021 8:14 PM, Jason Wang wrote: On 2021/2/19 7:54 下午, Si-Wei Liu wrote: Commit 452639a64ad8 ("vdpa: make sure set_features is invoked for legacy") made an exception for legacy guests to reset features to 0, when config space is accessed before features are set. We should relieve the verify_min_features() check and allow features reset to 0 for this case. It's worth noting that not just legacy guests could access config space before features are set. For instance, when feature VIRTIO_NET_F_MTU is advertised some modern driver will try to access and validate the MTU present in the config space before virtio features are set. This looks like a spec violation: " The following driver-read-only field, mtu only exists if VIRTIO_NET_F_MTU is set. This field specifies the maximum MTU for the driver to use. " Do we really want to workaround this? Isn't the commit 452639a64ad8 itself is a workaround for legacy guest? I think the point is, since there's legacy guest we'd have to support, this host side workaround is unavoidable. Although I agree the violating driver should be fixed (yes, it's in today's upstream kernel which exists for a while now). -Siwei Thanks Rejecting reset to 0 prematurely causes correct MTU and link status unable to load for the very first config space access, rendering issues like guest showing inaccurate MTU value, or failure to reject out-of-range MTU. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +-- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 7c1f789..540dd67 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1490,14 +1490,6 @@ static u64 mlx5_vdpa_get_features(struct vdpa_device *vdev) return mvdev->mlx_features; } -static int verify_min_features(struct mlx5_vdpa_dev *mvdev, u64 features) -{ - if (!(features & BIT_ULL(VIRTIO_F_ACCESS_PLATFORM))) - return -EOPNOTSUPP; - - return 0; -} - static int setup_virtqueues(struct mlx5_vdpa_net *ndev) { int err; @@ -1558,18 +1550,13 @@ static int mlx5_vdpa_set_features(struct vdpa_device *vdev, u64 features) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); - int err; print_features(mvdev, features, true); - err = verify_min_features(mvdev, features); - if (err) - return err; - ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features; ndev->config.mtu = cpu_to_mlx5vdpa16(mvdev, ndev->mtu); ndev->config.status |= cpu_to_mlx5vdpa16(mvdev, VIRTIO_NET_S_LINK_UP); - return err; + return 0; } static void mlx5_vdpa_set_config_cb(struct vdpa_device *vdev, struct vdpa_callback *cb)
Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map
On 2/19/2021 6:38 PM, Jason Wang wrote: Right now the value is exposed to userspace via GET_VRING_BASE, so only last_avail_idx is synced. If we need sync last_used_idx, we should also sync pending indices which requires more thoughts. Technically it doesn't sound right - crossing the boundary a bit even with simplified form of assumption. But depending on how userspace could make use of this API, it doesn't seem it breaks existing functionality for the moment. I don't get here, maybe you can explain a little bit more? Please refer to the email I just sent. https://lore.kernel.org/lkml/033b0806-4037-5755-a1fa-91dbb58ba...@oracle.com/ -Siwei
Re: [PATCH 1/2] vdpa/mlx5: Fix suspend/resume index restoration
On 2/17/2021 11:42 AM, Si-Wei Liu wrote: On 2/16/2021 8:20 AM, Eli Cohen wrote: When we suspend the VM, the VDPA interface will be reset. When the VM is resumed again, clear_virtqueues() will clear the available and used indices resulting in hardware virqtqueue objects becoming out of sync. We can avoid this function alltogether since qemu will clear them if required, e.g. when the VM went through a reboot. Moreover, since the hw available and used indices should always be identical on query and should be restored to the same value same value for virtqueues that complete in order, we set the single value provided by set_vq_state(). In get_vq_state() we return the value of hardware used index. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen Acked-by: Si-Wei Liu I'd take it back for the moment, according to Jason's latest comment. Discussion still going. --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index b8e9d525d66c..a51b0f86afe2 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1169,6 +1169,7 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m return; } mvq->avail_idx = attr.available_index; + mvq->used_idx = attr.used_index; } static void suspend_vqs(struct mlx5_vdpa_net *ndev) @@ -1426,6 +1427,7 @@ static int mlx5_vdpa_set_vq_state(struct vdpa_device *vdev, u16 idx, return -EINVAL; } + mvq->used_idx = state->avail_index; mvq->avail_idx = state->avail_index; This is where things starts getting interesting. According to Jason, the original expectation of this API would be to restore the internal last_avail_index in the hardware. With Mellanox network vDPA hardware implementation, there's no such last_avail_index thing in the hardware but only a single last_used_index representing both indices, which should always be the same and in sync with each other. So from migration point of view, it appears logical to restore the saved last_avail_index to the last_used_index in the hardware, right? But what is the point to restore mvq->avail_idx? Actually, this code path is being repurposed to address the index reset problem in the device reset scenario. Where the mvq->avail_idx and mvq->used_idx are both reset to 0 once device is reset. This is a bit crossing the boundary to me. return 0; } @@ -1443,7 +1445,7 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device *vdev, u16 idx, struct vdpa * that cares about emulating the index after vq is stopped. */ if (!mvq->initialized) { - state->avail_index = mvq->avail_idx; + state->avail_index = mvq->used_idx; This is where the last_used_index gets loaded from the hardware (when device is stopped). return 0; } @@ -1452,7 +1454,7 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device *vdev, u16 idx, struct vdpa mlx5_vdpa_warn(mvdev, "failed to query virtqueue\n"); return err; } - state->avail_index = attr.available_index; + state->avail_index = attr.used_index; This code path never gets called from userspace (when device is running). -Siwei return 0; } @@ -1532,16 +1534,6 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev) } } -static void clear_virtqueues(struct mlx5_vdpa_net *ndev) -{ - int i; - - for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) { - ndev->vqs[i].avail_idx = 0; - ndev->vqs[i].used_idx = 0; - } -} - /* TODO: cross-endian support */ static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev) { @@ -1777,7 +1769,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); - clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; ++mvdev->generation;
Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map
On 2/18/2021 7:10 PM, Jason Wang wrote: On 2021/2/18 8:43 下午, Si-Wei Liu wrote: On 2/17/2021 8:44 PM, Jason Wang wrote: On 2021/2/10 下午4:59, Si-Wei Liu wrote: On 2/9/2021 7:53 PM, Jason Wang wrote: On 2021/2/10 上午10:30, Si-Wei Liu wrote: On 2/8/2021 10:37 PM, Jason Wang wrote: On 2021/2/9 下午2:12, Eli Cohen wrote: On Tue, Feb 09, 2021 at 11:20:14AM +0800, Jason Wang wrote: On 2021/2/8 下午6:04, Eli Cohen wrote: On Mon, Feb 08, 2021 at 05:04:27PM +0800, Jason Wang wrote: On 2021/2/8 下午2:37, Eli Cohen wrote: On Mon, Feb 08, 2021 at 12:27:18PM +0800, Jason Wang wrote: On 2021/2/6 上午7:07, Si-Wei Liu wrote: On 2/3/2021 11:36 PM, Eli Cohen wrote: When a change of memory map occurs, the hardware resources are destroyed and then re-created again with the new memory map. In such case, we need to restore the hardware available and used indices. The driver failed to restore the used index which is added here. Also, since the driver also fails to reset the available and used indices upon device reset, fix this here to avoid regression caused by the fact that used index may not be zero upon device reset. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen --- v0 -> v1: Clear indices upon device reset drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 88dde3455bfd..b5fe6d2ad22f 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -87,6 +87,7 @@ struct mlx5_vq_restore_info { u64 device_addr; u64 driver_addr; u16 avail_index; + u16 used_index; bool ready; struct vdpa_callback cb; bool restore; @@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue { u32 virtq_id; struct mlx5_vdpa_net *ndev; u16 avail_idx; + u16 used_idx; int fw_state; /* keep last in the struct */ @@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtque obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, obj_context); MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, mvq->avail_idx); + MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, mvq->used_idx); MLX5_SET(virtio_net_q_object, obj_context, queue_feature_bit_mask_12_3, get_features_12_3(ndev->mvdev.actual_features)); vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, virtio_q_context); @@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m struct mlx5_virtq_attr { u8 state; u16 available_index; + u16 used_index; }; static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq, @@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu memset(attr, 0, sizeof(*attr)); attr->state = MLX5_GET(virtio_net_q_object, obj_context, state); attr->available_index = MLX5_GET(virtio_net_q_object, obj_context, hw_available_index); + attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, hw_used_index); kfree(out); return 0; @@ -1535,6 +1540,16 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev) } } +static void clear_virtqueues(struct mlx5_vdpa_net *ndev) +{ + int i; + + for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) { + ndev->vqs[i].avail_idx = 0; + ndev->vqs[i].used_idx = 0; + } +} + /* TODO: cross-endian support */ static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev) { @@ -1610,6 +1625,7 @@ static int save_channel_info(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu return err; ri->avail_index = attr.available_index; + ri->used_index = attr.used_index; ri->ready = mvq->ready; ri->num_ent = mvq->num_ent; ri->desc_addr = mvq->desc_addr; @@ -1654,6 +1670,7 @@ static void restore_channels_info(struct mlx5_vdpa_net *ndev) continue; mvq->avail_idx = ri->avail_index; + mvq->used_idx = ri->used_index; mvq->ready = ri->ready; mvq->num_ent = ri->num_ent; mvq->desc_addr = ri->desc_addr; @@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); + clear_virtqueues(ndev); The clearing looks fine at the first glance, as it aligns with the other state cleanups floating around at the same place. However, the thing is get_vq_st
[PATCH] vdpa/mlx5: set_features should allow reset to zero
Commit 452639a64ad8 ("vdpa: make sure set_features is invoked for legacy") made an exception for legacy guests to reset features to 0, when config space is accessed before features are set. We should relieve the verify_min_features() check and allow features reset to 0 for this case. It's worth noting that not just legacy guests could access config space before features are set. For instance, when feature VIRTIO_NET_F_MTU is advertised some modern driver will try to access and validate the MTU present in the config space before virtio features are set. Rejecting reset to 0 prematurely causes correct MTU and link status unable to load for the very first config space access, rendering issues like guest showing inaccurate MTU value, or failure to reject out-of-range MTU. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +-- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 7c1f789..540dd67 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1490,14 +1490,6 @@ static u64 mlx5_vdpa_get_features(struct vdpa_device *vdev) return mvdev->mlx_features; } -static int verify_min_features(struct mlx5_vdpa_dev *mvdev, u64 features) -{ - if (!(features & BIT_ULL(VIRTIO_F_ACCESS_PLATFORM))) - return -EOPNOTSUPP; - - return 0; -} - static int setup_virtqueues(struct mlx5_vdpa_net *ndev) { int err; @@ -1558,18 +1550,13 @@ static int mlx5_vdpa_set_features(struct vdpa_device *vdev, u64 features) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); - int err; print_features(mvdev, features, true); - err = verify_min_features(mvdev, features); - if (err) - return err; - ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features; ndev->config.mtu = cpu_to_mlx5vdpa16(mvdev, ndev->mtu); ndev->config.status |= cpu_to_mlx5vdpa16(mvdev, VIRTIO_NET_S_LINK_UP); - return err; + return 0; } static void mlx5_vdpa_set_config_cb(struct vdpa_device *vdev, struct vdpa_callback *cb) -- 1.8.3.1
Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map
On 2/17/2021 8:44 PM, Jason Wang wrote: On 2021/2/10 下午4:59, Si-Wei Liu wrote: On 2/9/2021 7:53 PM, Jason Wang wrote: On 2021/2/10 上午10:30, Si-Wei Liu wrote: On 2/8/2021 10:37 PM, Jason Wang wrote: On 2021/2/9 下午2:12, Eli Cohen wrote: On Tue, Feb 09, 2021 at 11:20:14AM +0800, Jason Wang wrote: On 2021/2/8 下午6:04, Eli Cohen wrote: On Mon, Feb 08, 2021 at 05:04:27PM +0800, Jason Wang wrote: On 2021/2/8 下午2:37, Eli Cohen wrote: On Mon, Feb 08, 2021 at 12:27:18PM +0800, Jason Wang wrote: On 2021/2/6 上午7:07, Si-Wei Liu wrote: On 2/3/2021 11:36 PM, Eli Cohen wrote: When a change of memory map occurs, the hardware resources are destroyed and then re-created again with the new memory map. In such case, we need to restore the hardware available and used indices. The driver failed to restore the used index which is added here. Also, since the driver also fails to reset the available and used indices upon device reset, fix this here to avoid regression caused by the fact that used index may not be zero upon device reset. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen --- v0 -> v1: Clear indices upon device reset drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 88dde3455bfd..b5fe6d2ad22f 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -87,6 +87,7 @@ struct mlx5_vq_restore_info { u64 device_addr; u64 driver_addr; u16 avail_index; + u16 used_index; bool ready; struct vdpa_callback cb; bool restore; @@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue { u32 virtq_id; struct mlx5_vdpa_net *ndev; u16 avail_idx; + u16 used_idx; int fw_state; /* keep last in the struct */ @@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtque obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, obj_context); MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, mvq->avail_idx); + MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, mvq->used_idx); MLX5_SET(virtio_net_q_object, obj_context, queue_feature_bit_mask_12_3, get_features_12_3(ndev->mvdev.actual_features)); vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, virtio_q_context); @@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m struct mlx5_virtq_attr { u8 state; u16 available_index; + u16 used_index; }; static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq, @@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu memset(attr, 0, sizeof(*attr)); attr->state = MLX5_GET(virtio_net_q_object, obj_context, state); attr->available_index = MLX5_GET(virtio_net_q_object, obj_context, hw_available_index); + attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, hw_used_index); kfree(out); return 0; @@ -1535,6 +1540,16 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev) } } +static void clear_virtqueues(struct mlx5_vdpa_net *ndev) +{ + int i; + + for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) { + ndev->vqs[i].avail_idx = 0; + ndev->vqs[i].used_idx = 0; + } +} + /* TODO: cross-endian support */ static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev) { @@ -1610,6 +1625,7 @@ static int save_channel_info(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu return err; ri->avail_index = attr.available_index; + ri->used_index = attr.used_index; ri->ready = mvq->ready; ri->num_ent = mvq->num_ent; ri->desc_addr = mvq->desc_addr; @@ -1654,6 +1670,7 @@ static void restore_channels_info(struct mlx5_vdpa_net *ndev) continue; mvq->avail_idx = ri->avail_index; + mvq->used_idx = ri->used_index; mvq->ready = ri->ready; mvq->num_ent = ri->num_ent; mvq->desc_addr = ri->desc_addr; @@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); + clear_virtqueues(ndev); The clearing looks fine at the first glance, as it aligns with the other state cleanups floating around at the same place. However, the thing is get_vq_state() is supposed to be called right after to get sync'ed with the latest internal
Re: [PATCH v2 3/3] vdpa/mlx5: defer clear_virtqueues to until DRIVER_OK
On 2/16/2021 7:21 AM, Eli Cohen wrote: On Thu, Feb 11, 2021 at 09:33:14AM +0200, Eli Cohen wrote: On Wed, Feb 10, 2021 at 01:48:00PM -0800, Si-Wei Liu wrote: While virtq is stopped, get_vq_state() is supposed to be called to get sync'ed with the latest internal avail_index from device. The saved avail_index is used to restate the virtq once device is started. Commit b35ccebe3ef7 introduced the clear_virtqueues() routine to reset the saved avail_index, however, the index gets cleared a bit earlier before get_vq_state() tries to read it. This would cause consistency problems when virtq is restarted, e.g. through a series of link down and link up events. We could defer the clearing of avail_index to until the device is to be started, i.e. until VIRTIO_CONFIG_S_DRIVER_OK is set again in set_status(). Fixes: b35ccebe3ef7 ("vdpa/mlx5: Restore the hardware used index after change map") Signed-off-by: Si-Wei Liu Acked-by: Jason Wang Acked-by: Eli Cohen I take it back. I think we don't need to clear the indexes at all. In case we need to restore indexes we'll get the right values through set_vq_state(). If we suspend the virtqueue due to VM being suspended, qemu will query first and will provide the the queried value. In case of VM reboot, it will provide 0 in set_vq_state(). I am sending a patch that addresses both reboot and suspend. With set_vq_state() repurposed to restoring used_index I'm fine with this approach. Do I have to repost a v3 of this series while dropping the 3rd patch? -Siwei --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 7c1f789..ce6aae8 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1777,7 +1777,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); - clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; ++mvdev->generation; @@ -1786,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if ((status ^ ndev->mvdev.status) & VIRTIO_CONFIG_S_DRIVER_OK) { if (status & VIRTIO_CONFIG_S_DRIVER_OK) { + clear_virtqueues(ndev); err = setup_driver(ndev); if (err) { mlx5_vdpa_warn(mvdev, "failed to setup driver\n"); -- 1.8.3.1
Re: [PATCH 1/2] vdpa/mlx5: Fix suspend/resume index restoration
On 2/17/2021 1:20 PM, Michael S. Tsirkin wrote: On Wed, Feb 17, 2021 at 11:42:48AM -0800, Si-Wei Liu wrote: On 2/16/2021 8:20 AM, Eli Cohen wrote: When we suspend the VM, the VDPA interface will be reset. When the VM is resumed again, clear_virtqueues() will clear the available and used indices resulting in hardware virqtqueue objects becoming out of sync. We can avoid this function alltogether since qemu will clear them if required, e.g. when the VM went through a reboot. Moreover, since the hw available and used indices should always be identical on query and should be restored to the same value same value for virtqueues that complete in order, we set the single value provided by set_vq_state(). In get_vq_state() we return the value of hardware used index. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen Acked-by: Si-Wei Liu Seems to also fix b35ccebe3ef76168aa2edaa35809c0232cb3578e, right? I think so. It should have both "Fixes" tags. -Siwei --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index b8e9d525d66c..a51b0f86afe2 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1169,6 +1169,7 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m return; } mvq->avail_idx = attr.available_index; + mvq->used_idx = attr.used_index; } static void suspend_vqs(struct mlx5_vdpa_net *ndev) @@ -1426,6 +1427,7 @@ static int mlx5_vdpa_set_vq_state(struct vdpa_device *vdev, u16 idx, return -EINVAL; } + mvq->used_idx = state->avail_index; mvq->avail_idx = state->avail_index; return 0; } @@ -1443,7 +1445,7 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device *vdev, u16 idx, struct vdpa * that cares about emulating the index after vq is stopped. */ if (!mvq->initialized) { - state->avail_index = mvq->avail_idx; + state->avail_index = mvq->used_idx; return 0; } @@ -1452,7 +1454,7 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device *vdev, u16 idx, struct vdpa mlx5_vdpa_warn(mvdev, "failed to query virtqueue\n"); return err; } - state->avail_index = attr.available_index; + state->avail_index = attr.used_index; return 0; } @@ -1532,16 +1534,6 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev) } } -static void clear_virtqueues(struct mlx5_vdpa_net *ndev) -{ - int i; - - for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) { - ndev->vqs[i].avail_idx = 0; - ndev->vqs[i].used_idx = 0; - } -} - /* TODO: cross-endian support */ static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev) { @@ -1777,7 +1769,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); - clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; ++mvdev->generation;
Re: [PATCH 1/2] vdpa/mlx5: Fix suspend/resume index restoration
On 2/16/2021 8:20 AM, Eli Cohen wrote: When we suspend the VM, the VDPA interface will be reset. When the VM is resumed again, clear_virtqueues() will clear the available and used indices resulting in hardware virqtqueue objects becoming out of sync. We can avoid this function alltogether since qemu will clear them if required, e.g. when the VM went through a reboot. Moreover, since the hw available and used indices should always be identical on query and should be restored to the same value same value for virtqueues that complete in order, we set the single value provided by set_vq_state(). In get_vq_state() we return the value of hardware used index. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen Acked-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index b8e9d525d66c..a51b0f86afe2 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1169,6 +1169,7 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m return; } mvq->avail_idx = attr.available_index; + mvq->used_idx = attr.used_index; } static void suspend_vqs(struct mlx5_vdpa_net *ndev) @@ -1426,6 +1427,7 @@ static int mlx5_vdpa_set_vq_state(struct vdpa_device *vdev, u16 idx, return -EINVAL; } + mvq->used_idx = state->avail_index; mvq->avail_idx = state->avail_index; return 0; } @@ -1443,7 +1445,7 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device *vdev, u16 idx, struct vdpa * that cares about emulating the index after vq is stopped. */ if (!mvq->initialized) { - state->avail_index = mvq->avail_idx; + state->avail_index = mvq->used_idx; return 0; } @@ -1452,7 +1454,7 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device *vdev, u16 idx, struct vdpa mlx5_vdpa_warn(mvdev, "failed to query virtqueue\n"); return err; } - state->avail_index = attr.available_index; + state->avail_index = attr.used_index; return 0; } @@ -1532,16 +1534,6 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev) } } -static void clear_virtqueues(struct mlx5_vdpa_net *ndev) -{ - int i; - - for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) { - ndev->vqs[i].avail_idx = 0; - ndev->vqs[i].used_idx = 0; - } -} - /* TODO: cross-endian support */ static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev) { @@ -1777,7 +1769,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); - clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; ++mvdev->generation;
Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map
On 2/10/2021 7:45 AM, Eli Cohen wrote: On Wed, Feb 10, 2021 at 12:59:03AM -0800, Si-Wei Liu wrote: On 2/9/2021 7:53 PM, Jason Wang wrote: On 2021/2/10 上午10:30, Si-Wei Liu wrote: On 2/8/2021 10:37 PM, Jason Wang wrote: On 2021/2/9 下午2:12, Eli Cohen wrote: On Tue, Feb 09, 2021 at 11:20:14AM +0800, Jason Wang wrote: On 2021/2/8 下午6:04, Eli Cohen wrote: On Mon, Feb 08, 2021 at 05:04:27PM +0800, Jason Wang wrote: On 2021/2/8 下午2:37, Eli Cohen wrote: On Mon, Feb 08, 2021 at 12:27:18PM +0800, Jason Wang wrote: On 2021/2/6 上午7:07, Si-Wei Liu wrote: On 2/3/2021 11:36 PM, Eli Cohen wrote: When a change of memory map occurs, the hardware resources are destroyed and then re-created again with the new memory map. In such case, we need to restore the hardware available and used indices. The driver failed to restore the used index which is added here. Also, since the driver also fails to reset the available and used indices upon device reset, fix this here to avoid regression caused by the fact that used index may not be zero upon device reset. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen --- v0 -> v1: Clear indices upon device reset drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 88dde3455bfd..b5fe6d2ad22f 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -87,6 +87,7 @@ struct mlx5_vq_restore_info { u64 device_addr; u64 driver_addr; u16 avail_index; + u16 used_index; bool ready; struct vdpa_callback cb; bool restore; @@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue { u32 virtq_id; struct mlx5_vdpa_net *ndev; u16 avail_idx; + u16 used_idx; int fw_state; /* keep last in the struct */ @@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtque obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, obj_context); MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, mvq->avail_idx); + MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, mvq->used_idx); MLX5_SET(virtio_net_q_object, obj_context, queue_feature_bit_mask_12_3, get_features_12_3(ndev->mvdev.actual_features)); vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, virtio_q_context); @@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m struct mlx5_virtq_attr { u8 state; u16 available_index; + u16 used_index; }; static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq, @@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu memset(attr, 0, sizeof(*attr)); attr->state = MLX5_GET(virtio_net_q_object, obj_context, state); attr->available_index = MLX5_GET(virtio_net_q_object, obj_context, hw_available_index); + attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, hw_used_index); kfree(out); return 0; @@ -1535,6 +1540,16 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev) } } +static void clear_virtqueues(struct mlx5_vdpa_net *ndev) +{ + int i; + + for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) { + ndev->vqs[i].avail_idx = 0; + ndev->vqs[i].used_idx = 0; + } +} + /* TODO: cross-endian support */ static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev) { @@ -1610,6 +1625,7 @@ static int save_channel_info(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu return err; ri->avail_index = attr.available_index; + ri->used_index = attr.used_index; ri->ready = mvq->ready; ri->num_ent = mvq->num_ent; ri->desc_addr = mvq->desc_addr; @@ -1654,6 +1670,7 @@ static void restore_channels_info(struct mlx5_vdpa_net *ndev) continue; mvq->avail_idx = ri->avail_index; + mvq->used_idx = ri->used_index; mvq->ready = ri->ready; mvq->num_ent = ri->num_ent; mvq->desc_addr = ri->desc_addr; @@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); + clear_virtqueues(ndev); The clearing looks fine at the first glance, as it aligns with the other state cleanups floating around at the same place. However, the thing is get_vq_state() is supposed to be called right after to get sync'ed
Re: Regressions with VMBus/VSCs hardening changes
On Fri, Feb 12, 2021 at 05:50:50PM +0100, Andrea Parri wrote: > Hi all, [...] > 2) IIUC a8c3209998afb5 could be dropped (after rebase) without further modi- >fications to hyperv-next. I've reverted the said patch from hyperv-next. Wei.
Re: [PATCH v2 4/8] xen/netback: fix spurious event detection for common event case
On Thu, Feb 11, 2021 at 11:16:12AM +0100, Juergen Gross wrote: > In case of a common event for rx and tx queue the event should be > regarded to be spurious if no rx and no tx requests are pending. > > Unfortunately the condition for testing that is wrong causing to > decide a event being spurious if no rx OR no tx requests are > pending. > > Fix that plus using local variables for rx/tx pending indicators in > order to split function calls and if condition. > > Fixes: 23025393dbeb3b ("xen/netback: use lateeoi irq binding") > Signed-off-by: Juergen Gross Reviewed-by: Wei Liu
[PATCH v2 1/3] vdpa/mlx5: should exclude header length and fcs from mtu
When feature VIRTIO_NET_F_MTU is negotiated on mlx5_vdpa, 22 extra bytes worth of MTU length is shown in guest. This is because the mlx5_query_port_max_mtu API returns the "hardware" MTU value, which does not just contain the Ethernet payload, but includes extra lengths starting from the Ethernet header up to the FCS altogether. Fix the MTU so packets won't get dropped silently. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Si-Wei Liu Acked-by: Jason Wang Acked-by: Eli Cohen --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 4 drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 ++- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index 08f742f..b6cc53b 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -4,9 +4,13 @@ #ifndef __MLX5_VDPA_H__ #define __MLX5_VDPA_H__ +#include +#include #include #include +#define MLX5V_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN) + struct mlx5_vdpa_direct_mr { u64 start; u64 end; diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index dc88559..b8416c4 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1907,6 +1907,19 @@ static int mlx5_get_vq_irq(struct vdpa_device *vdv, u16 idx) .free = mlx5_vdpa_free, }; +static int query_mtu(struct mlx5_core_dev *mdev, u16 *mtu) +{ + u16 hw_mtu; + int err; + + err = mlx5_query_nic_vport_mtu(mdev, &hw_mtu); + if (err) + return err; + + *mtu = hw_mtu - MLX5V_ETH_HARD_MTU; + return 0; +} + static int alloc_resources(struct mlx5_vdpa_net *ndev) { struct mlx5_vdpa_net_resources *res = &ndev->res; @@ -1992,7 +2005,7 @@ static int mlx5v_probe(struct auxiliary_device *adev, init_mvqs(ndev); mutex_init(&ndev->reslock); config = &ndev->config; - err = mlx5_query_nic_vport_mtu(mdev, &ndev->mtu); + err = query_mtu(mdev, &ndev->mtu); if (err) goto err_mtu; -- 1.8.3.1
[PATCH v2 3/3] vdpa/mlx5: defer clear_virtqueues to until DRIVER_OK
While virtq is stopped, get_vq_state() is supposed to be called to get sync'ed with the latest internal avail_index from device. The saved avail_index is used to restate the virtq once device is started. Commit b35ccebe3ef7 introduced the clear_virtqueues() routine to reset the saved avail_index, however, the index gets cleared a bit earlier before get_vq_state() tries to read it. This would cause consistency problems when virtq is restarted, e.g. through a series of link down and link up events. We could defer the clearing of avail_index to until the device is to be started, i.e. until VIRTIO_CONFIG_S_DRIVER_OK is set again in set_status(). Fixes: b35ccebe3ef7 ("vdpa/mlx5: Restore the hardware used index after change map") Signed-off-by: Si-Wei Liu Acked-by: Jason Wang --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 7c1f789..ce6aae8 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1777,7 +1777,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); - clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; ++mvdev->generation; @@ -1786,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if ((status ^ ndev->mvdev.status) & VIRTIO_CONFIG_S_DRIVER_OK) { if (status & VIRTIO_CONFIG_S_DRIVER_OK) { + clear_virtqueues(ndev); err = setup_driver(ndev); if (err) { mlx5_vdpa_warn(mvdev, "failed to setup driver\n"); -- 1.8.3.1
[PATCH v2 0/3] mlx5_vdpa bug fixes
This set attempts to fix a few independent issues recently found in mlx5_vdpa driver. Please find details for each in the commit message. Patch 1 and patch 3 are already Ack'ed by Jason Wang. Patch 2 is reworked to move virtio feature capability query to mlx5v_probe() as suggested by Eli. -- v1->v2: move feature capability query to probing (Eli) Si-Wei Liu (3): vdpa/mlx5: should exclude header length and fcs from mtu vdpa/mlx5: fix feature negotiation across device reset vdpa/mlx5: defer clear_virtqueues to until DRIVER_OK drivers/vdpa/mlx5/core/mlx5_vdpa.h | 4 drivers/vdpa/mlx5/net/mlx5_vnet.c | 42 +++--- 2 files changed, 34 insertions(+), 12 deletions(-) -- 1.8.3.1
[PATCH v2 2/3] vdpa/mlx5: fix feature negotiation across device reset
The mlx_features denotes the capability for which set of virtio features is supported by device. In principle, this field needs not be cleared during virtio device reset, as this capability is static and does not change across reset. In fact, the current code may have the assumption that mlx_features can be reloaded from firmware via the .get_features ops after device is reset (via the .set_status ops), which is unfortunately not true. The userspace VMM might save a copy of backend capable features and won't call into kernel again to get it on reset. This causes all virtio features getting disabled on newly created virtqs after device reset, while guest would hold mismatched view of available features. For e.g., the guest may still assume tx checksum offload is available after reset and feature negotiation, causing frames with bogus (incomplete) checksum transmitted on the wire. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index b8416c4..7c1f789 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1486,16 +1486,8 @@ static u64 mlx_to_vritio_features(u16 dev_features) static u64 mlx5_vdpa_get_features(struct vdpa_device *vdev) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); - struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); - u16 dev_features; - dev_features = MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, device_features_bits_mask); - ndev->mvdev.mlx_features = mlx_to_vritio_features(dev_features); - if (MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, virtio_version_1_0)) - ndev->mvdev.mlx_features |= BIT_ULL(VIRTIO_F_VERSION_1); - ndev->mvdev.mlx_features |= BIT_ULL(VIRTIO_F_ACCESS_PLATFORM); - print_features(mvdev, ndev->mvdev.mlx_features, false); - return ndev->mvdev.mlx_features; + return mvdev->mlx_features; } static int verify_min_features(struct mlx5_vdpa_dev *mvdev, u64 features) @@ -1788,7 +1780,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; - ndev->mvdev.mlx_features = 0; ++mvdev->generation; return; } @@ -1907,6 +1898,19 @@ static int mlx5_get_vq_irq(struct vdpa_device *vdv, u16 idx) .free = mlx5_vdpa_free, }; +static void query_virtio_features(struct mlx5_vdpa_net *ndev) +{ + struct mlx5_vdpa_dev *mvdev = &ndev->mvdev; + u16 dev_features; + + dev_features = MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, device_features_bits_mask); + mvdev->mlx_features = mlx_to_vritio_features(dev_features); + if (MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, virtio_version_1_0)) + mvdev->mlx_features |= BIT_ULL(VIRTIO_F_VERSION_1); + mvdev->mlx_features |= BIT_ULL(VIRTIO_F_ACCESS_PLATFORM); + print_features(mvdev, mvdev->mlx_features, false); +} + static int query_mtu(struct mlx5_core_dev *mdev, u16 *mtu) { u16 hw_mtu; @@ -2005,6 +2009,7 @@ static int mlx5v_probe(struct auxiliary_device *adev, init_mvqs(ndev); mutex_init(&ndev->reslock); config = &ndev->config; + query_virtio_features(ndev); err = query_mtu(mdev, &ndev->mtu); if (err) goto err_mtu; -- 1.8.3.1
Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map
On 2/9/2021 7:53 PM, Jason Wang wrote: On 2021/2/10 上午10:30, Si-Wei Liu wrote: On 2/8/2021 10:37 PM, Jason Wang wrote: On 2021/2/9 下午2:12, Eli Cohen wrote: On Tue, Feb 09, 2021 at 11:20:14AM +0800, Jason Wang wrote: On 2021/2/8 下午6:04, Eli Cohen wrote: On Mon, Feb 08, 2021 at 05:04:27PM +0800, Jason Wang wrote: On 2021/2/8 下午2:37, Eli Cohen wrote: On Mon, Feb 08, 2021 at 12:27:18PM +0800, Jason Wang wrote: On 2021/2/6 上午7:07, Si-Wei Liu wrote: On 2/3/2021 11:36 PM, Eli Cohen wrote: When a change of memory map occurs, the hardware resources are destroyed and then re-created again with the new memory map. In such case, we need to restore the hardware available and used indices. The driver failed to restore the used index which is added here. Also, since the driver also fails to reset the available and used indices upon device reset, fix this here to avoid regression caused by the fact that used index may not be zero upon device reset. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen --- v0 -> v1: Clear indices upon device reset drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 88dde3455bfd..b5fe6d2ad22f 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -87,6 +87,7 @@ struct mlx5_vq_restore_info { u64 device_addr; u64 driver_addr; u16 avail_index; + u16 used_index; bool ready; struct vdpa_callback cb; bool restore; @@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue { u32 virtq_id; struct mlx5_vdpa_net *ndev; u16 avail_idx; + u16 used_idx; int fw_state; /* keep last in the struct */ @@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtque obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, obj_context); MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, mvq->avail_idx); + MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, mvq->used_idx); MLX5_SET(virtio_net_q_object, obj_context, queue_feature_bit_mask_12_3, get_features_12_3(ndev->mvdev.actual_features)); vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, virtio_q_context); @@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m struct mlx5_virtq_attr { u8 state; u16 available_index; + u16 used_index; }; static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq, @@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu memset(attr, 0, sizeof(*attr)); attr->state = MLX5_GET(virtio_net_q_object, obj_context, state); attr->available_index = MLX5_GET(virtio_net_q_object, obj_context, hw_available_index); + attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, hw_used_index); kfree(out); return 0; @@ -1535,6 +1540,16 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev) } } +static void clear_virtqueues(struct mlx5_vdpa_net *ndev) +{ + int i; + + for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) { + ndev->vqs[i].avail_idx = 0; + ndev->vqs[i].used_idx = 0; + } +} + /* TODO: cross-endian support */ static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev) { @@ -1610,6 +1625,7 @@ static int save_channel_info(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu return err; ri->avail_index = attr.available_index; + ri->used_index = attr.used_index; ri->ready = mvq->ready; ri->num_ent = mvq->num_ent; ri->desc_addr = mvq->desc_addr; @@ -1654,6 +1670,7 @@ static void restore_channels_info(struct mlx5_vdpa_net *ndev) continue; mvq->avail_idx = ri->avail_index; + mvq->used_idx = ri->used_index; mvq->ready = ri->ready; mvq->num_ent = ri->num_ent; mvq->desc_addr = ri->desc_addr; @@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); + clear_virtqueues(ndev); The clearing looks fine at the first glance, as it aligns with the other state cleanups floating around at the same place. However, the thing is get_vq_state() is supposed to be called right after to get sync'ed with the latest internal avail_index from device while vq is stopped. The index was saved in the driver software a
Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map
On 2/8/2021 10:37 PM, Jason Wang wrote: On 2021/2/9 下午2:12, Eli Cohen wrote: On Tue, Feb 09, 2021 at 11:20:14AM +0800, Jason Wang wrote: On 2021/2/8 下午6:04, Eli Cohen wrote: On Mon, Feb 08, 2021 at 05:04:27PM +0800, Jason Wang wrote: On 2021/2/8 下午2:37, Eli Cohen wrote: On Mon, Feb 08, 2021 at 12:27:18PM +0800, Jason Wang wrote: On 2021/2/6 上午7:07, Si-Wei Liu wrote: On 2/3/2021 11:36 PM, Eli Cohen wrote: When a change of memory map occurs, the hardware resources are destroyed and then re-created again with the new memory map. In such case, we need to restore the hardware available and used indices. The driver failed to restore the used index which is added here. Also, since the driver also fails to reset the available and used indices upon device reset, fix this here to avoid regression caused by the fact that used index may not be zero upon device reset. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen --- v0 -> v1: Clear indices upon device reset drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 88dde3455bfd..b5fe6d2ad22f 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -87,6 +87,7 @@ struct mlx5_vq_restore_info { u64 device_addr; u64 driver_addr; u16 avail_index; + u16 used_index; bool ready; struct vdpa_callback cb; bool restore; @@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue { u32 virtq_id; struct mlx5_vdpa_net *ndev; u16 avail_idx; + u16 used_idx; int fw_state; /* keep last in the struct */ @@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtque obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, obj_context); MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, mvq->avail_idx); + MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, mvq->used_idx); MLX5_SET(virtio_net_q_object, obj_context, queue_feature_bit_mask_12_3, get_features_12_3(ndev->mvdev.actual_features)); vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, virtio_q_context); @@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m struct mlx5_virtq_attr { u8 state; u16 available_index; + u16 used_index; }; static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq, @@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu memset(attr, 0, sizeof(*attr)); attr->state = MLX5_GET(virtio_net_q_object, obj_context, state); attr->available_index = MLX5_GET(virtio_net_q_object, obj_context, hw_available_index); + attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, hw_used_index); kfree(out); return 0; @@ -1535,6 +1540,16 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev) } } +static void clear_virtqueues(struct mlx5_vdpa_net *ndev) +{ + int i; + + for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) { + ndev->vqs[i].avail_idx = 0; + ndev->vqs[i].used_idx = 0; + } +} + /* TODO: cross-endian support */ static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev) { @@ -1610,6 +1625,7 @@ static int save_channel_info(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu return err; ri->avail_index = attr.available_index; + ri->used_index = attr.used_index; ri->ready = mvq->ready; ri->num_ent = mvq->num_ent; ri->desc_addr = mvq->desc_addr; @@ -1654,6 +1670,7 @@ static void restore_channels_info(struct mlx5_vdpa_net *ndev) continue; mvq->avail_idx = ri->avail_index; + mvq->used_idx = ri->used_index; mvq->ready = ri->ready; mvq->num_ent = ri->num_ent; mvq->desc_addr = ri->desc_addr; @@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); + clear_virtqueues(ndev); The clearing looks fine at the first glance, as it aligns with the other state cleanups floating around at the same place. However, the thing is get_vq_state() is supposed to be called right after to get sync'ed with the latest internal avail_index from device while vq is stopped. The index was saved in the driver software at vq suspension, but before the virtq object is destroyed. We shouldn't clear the avai
Re: [PATCH 3/3] mlx5_vdpa: defer clear_virtqueues to until DRIVER_OK
On 2/8/2021 7:37 PM, Jason Wang wrote: On 2021/2/6 下午8:29, Si-Wei Liu wrote: While virtq is stopped, get_vq_state() is supposed to be called to get sync'ed with the latest internal avail_index from device. The saved avail_index is used to restate the virtq once device is started. Commit b35ccebe3ef7 introduced the clear_virtqueues() routine to reset the saved avail_index, however, the index gets cleared a bit earlier before get_vq_state() tries to read it. This would cause consistency problems when virtq is restarted, e.g. through a series of link down and link up events. We could defer the clearing of avail_index to until the device is to be started, i.e. until VIRTIO_CONFIG_S_DRIVER_OK is set again in set_status(). Fixes: b35ccebe3ef7 ("vdpa/mlx5: Restore the hardware used index after change map") Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index aa6f8cd..444ab58 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1785,7 +1785,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); - clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; ++mvdev->generation; @@ -1794,6 +1793,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if ((status ^ ndev->mvdev.status) & VIRTIO_CONFIG_S_DRIVER_OK) { if (status & VIRTIO_CONFIG_S_DRIVER_OK) { + clear_virtqueues(ndev); Rethink about this. As mentioned in another thread, this in fact breaks set_vq_state(). (See vhost_virtqueue_start() -> vhost_vdpa_set_vring_base() in qemu codes). I assume that the clearing for vhost-vdpa would be done via (qemu code), vhost_dev_start()->vhost_vdpa_dev_start()->vhost_vdpa_call(status | VIRTIO_CONFIG_S_DRIVER_OK) which is _after_ vhost_virtqueue_start() gets called to restore the avail_idx to h/w in vhost_dev_start(). What am I missing here? -Siwei The issue is that the avail idx is forgot, we need keep it. Thanks err = setup_driver(ndev); if (err) { mlx5_vdpa_warn(mvdev, "failed to setup driver\n");
Re: [PATCH 4/7] xen/events: link interdomain events to associated xenbus device
On Sat, Feb 06, 2021 at 11:49:29AM +0100, Juergen Gross wrote: > In order to support the possibility of per-device event channel > settings (e.g. lateeoi spurious event thresholds) add a xenbus device > pointer to struct irq_info() and modify the related event channel > binding interfaces to take the pointer to the xenbus device as a > parameter instead of the domain id of the other side. > > While at it remove the stale prototype of bind_evtchn_to_irq_lateeoi(). > > Signed-off-by: Juergen Gross Reviewed-by: Wei Liu
Re: [PATCH 3/3] mlx5_vdpa: defer clear_virtqueues to until DRIVER_OK
On 2/7/2021 9:48 PM, Eli Cohen wrote: On Sat, Feb 06, 2021 at 04:29:24AM -0800, Si-Wei Liu wrote: While virtq is stopped, get_vq_state() is supposed to be called to get sync'ed with the latest internal avail_index from device. The saved avail_index is used to restate the virtq once device is started. Commit b35ccebe3ef7 introduced the clear_virtqueues() routine to reset the saved avail_index, however, the index gets cleared a bit earlier before get_vq_state() tries to read it. This would cause consistency problems when virtq is restarted, e.g. through a series of link down and link up events. We could defer the clearing of avail_index to until the device is to be started, i.e. until VIRTIO_CONFIG_S_DRIVER_OK is set again in set_status(). Not sure I understand the scenario. You are talking about reset of the device followed by up/down events on the interface. How can you trigger this? Currently it's not possible to trigger link up/down events with upstream QEMU due lack of config/control interrupt implementation. And live migration could be another scenario that cannot be satisfied because of inconsistent queue state. They share the same root of cause as captured here. -Siwei Fixes: b35ccebe3ef7 ("vdpa/mlx5: Restore the hardware used index after change map") Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index aa6f8cd..444ab58 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1785,7 +1785,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); - clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; ++mvdev->generation; @@ -1794,6 +1793,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if ((status ^ ndev->mvdev.status) & VIRTIO_CONFIG_S_DRIVER_OK) { if (status & VIRTIO_CONFIG_S_DRIVER_OK) { + clear_virtqueues(ndev); err = setup_driver(ndev); if (err) { mlx5_vdpa_warn(mvdev, "failed to setup driver\n"); -- 1.8.3.1
Re: [PATCH 2/3] mlx5_vdpa: fix feature negotiation across device reset
On 2/7/2021 9:35 PM, Eli Cohen wrote: On Sat, Feb 06, 2021 at 04:29:23AM -0800, Si-Wei Liu wrote: The mlx_features denotes the capability for which set of virtio features is supported by device. In principle, this field needs not be cleared during virtio device reset, as this capability is static and does not change across reset. In fact, the current code may have the assumption that mlx_features can be reloaded from firmware via the .get_features ops after device is reset (via the .set_status ops), which is unfortunately not true. The userspace VMM might save a copy of backend capable features and won't call into kernel again to get it on reset. This causes all virtio features getting disabled on newly created virtqs after device reset, while guest would hold mismatched view of available features. For e.g., the guest may still assume tx checksum offload is available after reset and feature negotiation, causing frames with bogus (incomplete) checksum transmitted on the wire. Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index b8416c4..aa6f8cd 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1788,7 +1788,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; - ndev->mvdev.mlx_features = 0; ++mvdev->generation; return; } Since we assume that device capabilities don't change, I think I would get the features through a call done in mlx5v_probe after the netdev object is created and change mlx5_vdpa_get_features() to just return ndev->mvdev.mlx_features. Yep, it makes sense. Will post a revised patch. If vdpa tool allows reconfiguration post probing, the code has to be reconciled then. Did you actually see this issue in action? If you did, can you share with us how you trigerred this? Issue is indeed seen in action. The mismatched tx-checksum offload as described in the commit message was one of such examples. You would need a guest reboot though (triggering device reset via the .set_status ops and zero'ed mlx_features was loaded to deduce new actual_features for creating the h/w virtq object) for repro. -Siwei -- 1.8.3.1
[PATCH 1/3] mlx5_vdpa: should exclude header length and fcs from mtu
When feature VIRTIO_NET_F_MTU is negotiated on mlx5_vdpa, 22 extra bytes worth of MTU length is shown in guest. This is because the mlx5_query_port_max_mtu API returns the "hardware" MTU value, which does not just contain the Ethernet payload, but includes extra lengths starting from the Ethernet header up to the FCS altogether. Fix the MTU so packets won't get dropped silently. Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 4 drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 ++- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index 08f742f..b6cc53b 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -4,9 +4,13 @@ #ifndef __MLX5_VDPA_H__ #define __MLX5_VDPA_H__ +#include +#include #include #include +#define MLX5V_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN) + struct mlx5_vdpa_direct_mr { u64 start; u64 end; diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index dc88559..b8416c4 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1907,6 +1907,19 @@ static int mlx5_get_vq_irq(struct vdpa_device *vdv, u16 idx) .free = mlx5_vdpa_free, }; +static int query_mtu(struct mlx5_core_dev *mdev, u16 *mtu) +{ + u16 hw_mtu; + int err; + + err = mlx5_query_nic_vport_mtu(mdev, &hw_mtu); + if (err) + return err; + + *mtu = hw_mtu - MLX5V_ETH_HARD_MTU; + return 0; +} + static int alloc_resources(struct mlx5_vdpa_net *ndev) { struct mlx5_vdpa_net_resources *res = &ndev->res; @@ -1992,7 +2005,7 @@ static int mlx5v_probe(struct auxiliary_device *adev, init_mvqs(ndev); mutex_init(&ndev->reslock); config = &ndev->config; - err = mlx5_query_nic_vport_mtu(mdev, &ndev->mtu); + err = query_mtu(mdev, &ndev->mtu); if (err) goto err_mtu; -- 1.8.3.1
[PATCH 2/3] mlx5_vdpa: fix feature negotiation across device reset
The mlx_features denotes the capability for which set of virtio features is supported by device. In principle, this field needs not be cleared during virtio device reset, as this capability is static and does not change across reset. In fact, the current code may have the assumption that mlx_features can be reloaded from firmware via the .get_features ops after device is reset (via the .set_status ops), which is unfortunately not true. The userspace VMM might save a copy of backend capable features and won't call into kernel again to get it on reset. This causes all virtio features getting disabled on newly created virtqs after device reset, while guest would hold mismatched view of available features. For e.g., the guest may still assume tx checksum offload is available after reset and feature negotiation, causing frames with bogus (incomplete) checksum transmitted on the wire. Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index b8416c4..aa6f8cd 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1788,7 +1788,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; - ndev->mvdev.mlx_features = 0; ++mvdev->generation; return; } -- 1.8.3.1
[PATCH 3/3] mlx5_vdpa: defer clear_virtqueues to until DRIVER_OK
While virtq is stopped, get_vq_state() is supposed to be called to get sync'ed with the latest internal avail_index from device. The saved avail_index is used to restate the virtq once device is started. Commit b35ccebe3ef7 introduced the clear_virtqueues() routine to reset the saved avail_index, however, the index gets cleared a bit earlier before get_vq_state() tries to read it. This would cause consistency problems when virtq is restarted, e.g. through a series of link down and link up events. We could defer the clearing of avail_index to until the device is to be started, i.e. until VIRTIO_CONFIG_S_DRIVER_OK is set again in set_status(). Fixes: b35ccebe3ef7 ("vdpa/mlx5: Restore the hardware used index after change map") Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index aa6f8cd..444ab58 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1785,7 +1785,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); - clear_virtqueues(ndev); mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; ++mvdev->generation; @@ -1794,6 +1793,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if ((status ^ ndev->mvdev.status) & VIRTIO_CONFIG_S_DRIVER_OK) { if (status & VIRTIO_CONFIG_S_DRIVER_OK) { + clear_virtqueues(ndev); err = setup_driver(ndev); if (err) { mlx5_vdpa_warn(mvdev, "failed to setup driver\n"); -- 1.8.3.1
Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map
On 2/3/2021 11:36 PM, Eli Cohen wrote: When a change of memory map occurs, the hardware resources are destroyed and then re-created again with the new memory map. In such case, we need to restore the hardware available and used indices. The driver failed to restore the used index which is added here. Also, since the driver also fails to reset the available and used indices upon device reset, fix this here to avoid regression caused by the fact that used index may not be zero upon device reset. Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") Signed-off-by: Eli Cohen --- v0 -> v1: Clear indices upon device reset drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 88dde3455bfd..b5fe6d2ad22f 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -87,6 +87,7 @@ struct mlx5_vq_restore_info { u64 device_addr; u64 driver_addr; u16 avail_index; + u16 used_index; bool ready; struct vdpa_callback cb; bool restore; @@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue { u32 virtq_id; struct mlx5_vdpa_net *ndev; u16 avail_idx; + u16 used_idx; int fw_state; /* keep last in the struct */ @@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtque obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, obj_context); MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, mvq->avail_idx); + MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, mvq->used_idx); MLX5_SET(virtio_net_q_object, obj_context, queue_feature_bit_mask_12_3, get_features_12_3(ndev->mvdev.actual_features)); vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, virtio_q_context); @@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m struct mlx5_virtq_attr { u8 state; u16 available_index; + u16 used_index; }; static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq, @@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu memset(attr, 0, sizeof(*attr)); attr->state = MLX5_GET(virtio_net_q_object, obj_context, state); attr->available_index = MLX5_GET(virtio_net_q_object, obj_context, hw_available_index); + attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, hw_used_index); kfree(out); return 0; @@ -1535,6 +1540,16 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev) } } +static void clear_virtqueues(struct mlx5_vdpa_net *ndev) +{ + int i; + + for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) { + ndev->vqs[i].avail_idx = 0; + ndev->vqs[i].used_idx = 0; + } +} + /* TODO: cross-endian support */ static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev) { @@ -1610,6 +1625,7 @@ static int save_channel_info(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu return err; ri->avail_index = attr.available_index; + ri->used_index = attr.used_index; ri->ready = mvq->ready; ri->num_ent = mvq->num_ent; ri->desc_addr = mvq->desc_addr; @@ -1654,6 +1670,7 @@ static void restore_channels_info(struct mlx5_vdpa_net *ndev) continue; mvq->avail_idx = ri->avail_index; + mvq->used_idx = ri->used_index; mvq->ready = ri->ready; mvq->num_ent = ri->num_ent; mvq->desc_addr = ri->desc_addr; @@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device *vdev, u8 status) if (!status) { mlx5_vdpa_info(mvdev, "performing device reset\n"); teardown_driver(ndev); + clear_virtqueues(ndev); The clearing looks fine at the first glance, as it aligns with the other state cleanups floating around at the same place. However, the thing is get_vq_state() is supposed to be called right after to get sync'ed with the latest internal avail_index from device while vq is stopped. The index was saved in the driver software at vq suspension, but before the virtq object is destroyed. We shouldn't clear the avail_index too early. Possibly it can be postponed to where VIRTIO_CONFIG_S_DRIVER_OK gets set again, i.e. right before the setup_driver() in mlx5_vdpa_set_status()? -Siwei mlx5_vdpa_destroy_mr(&ndev->mvdev); ndev->mvdev.status = 0; ndev->mvdev.mlx_features = 0;
Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue
On Tue, Feb 2, 2021 at 9:16 PM Jason Wang wrote: > > > On 2021/2/3 上午1:54, Si-Wei Liu wrote: > > On Tue, Feb 2, 2021 at 1:23 AM Eli Cohen wrote: > >> On Tue, Feb 02, 2021 at 12:38:51AM -0800, Si-Wei Liu wrote: > >>> Thanks Eli and Jason for clarifications. See inline. > >>> > >>> On Mon, Feb 1, 2021 at 11:06 PM Eli Cohen wrote: > >>>> On Tue, Feb 02, 2021 at 02:02:25PM +0800, Jason Wang wrote: > >>>>> On 2021/2/2 下午12:15, Si-Wei Liu wrote: > >>>>>> On Mon, Feb 1, 2021 at 7:13 PM Jason Wang wrote: > >>>>>>> On 2021/2/2 上午3:17, Si-Wei Liu wrote: > >>>>>>>> On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu > >>>>>>>> wrote: > >>>>>>>>> On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen wrote: > >>>>>>>>>> suspend_vq should only suspend the VQ on not save the current > >>>>>>>>>> available > >>>>>>>>>> index. This is done when a change of map occurs when the driver > >>>>>>>>>> calls > >>>>>>>>>> save_channel_info(). > >>>>>>>>> Hmmm, suspend_vq() is also called by teardown_vq(), the latter of > >>>>>>>>> which doesn't save the available index as save_channel_info() > >>>>>>>>> doesn't > >>>>>>>>> get called in that path at all. How does it handle the case that > >>>>>>>>> aget_vq_state() is called from userspace (e.g. QEMU) while the > >>>>>>>>> hardware VQ object was torn down, but userspace still wants to > >>>>>>>>> access > >>>>>>>>> the queue index? > >>>>>>>>> > >>>>>>>>> Refer to > >>>>>>>>> https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/ > >>>>>>>>> > >>>>>>>>> vhost VQ 0 ring restore failed: -1: Resource temporarily > >>>>>>>>> unavailable (11) > >>>>>>>>> vhost VQ 1 ring restore failed: -1: Resource temporarily > >>>>>>>>> unavailable (11) > >>>>>>>>> > >>>>>>>>> QEMU will complain with the above warning while VM is being rebooted > >>>>>>>>> or shut down. > >>>>>>>>> > >>>>>>>>> Looks to me either the kernel driver should cover this requirement, > >>>>>>>>> or > >>>>>>>>> the userspace has to bear the burden in saving the index and not > >>>>>>>>> call > >>>>>>>>> into kernel if VQ is destroyed. > >>>>>>>> Actually, the userspace doesn't have the insights whether virt queue > >>>>>>>> will be destroyed if just changing the device status via > >>>>>>>> set_status(). > >>>>>>>> Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave > >>>>>>>> like > >>>>>>>> so. Hence this still looks to me to be Mellanox specifics and > >>>>>>>> mlx5_vdpa implementation detail that shouldn't expose to userspace. > >>>>>>> So I think we can simply drop this patch? > >>>>>> Yep, I think so. To be honest I don't know why it has anything to do > >>>>>> with the memory hotplug issue. > >>>>> > >>>>> Eli may know more, my understanding is that, during memory hotplut, qemu > >>>>> need to propagate new memory mappings via set_map(). For mellanox, it > >>>>> means > >>>>> it needs to rebuild memory keys, so the virtqueue needs to be suspended. > >>>>> > >>>> I think Siwei was asking why the first patch was related to the hotplug > >>>> issue. > >>> I was thinking how consistency is assured when saving/restoring this > >>> h/w avail_index against the one in the virtq memory, particularly in > >>> the region_add/.region_del() context (e.g. the hotplug case). Problem > >>> is I don't see explicit memory barrier when guest thread updates the > >>> avail_index, how does the devic
Re: [PATCH] vdpa/mlx5: Restore the hardware used index after change map
On Tue, Feb 2, 2021 at 10:48 PM Eli Cohen wrote: > > On Tue, Feb 02, 2021 at 09:14:02AM -0800, Si-Wei Liu wrote: > > On Tue, Feb 2, 2021 at 6:34 AM Eli Cohen wrote: > > > > > > When a change of memory map occurs, the hardware resources are destroyed > > > and then re-created again with the new memory map. In such case, we need > > > to restore the hardware available and used indices. The driver failed to > > > restore the used index which is added here. > > > > > > Fixes 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 > > > devices") > > > Signed-off-by: Eli Cohen > > > --- > > > This patch is being sent again a single patch the fixes hot memory > > > addtion to a qemy process. > > > > > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 7 +++ > > > 1 file changed, 7 insertions(+) > > > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > index 88dde3455bfd..839f57c64a6f 100644 > > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > > @@ -87,6 +87,7 @@ struct mlx5_vq_restore_info { > > > u64 device_addr; > > > u64 driver_addr; > > > u16 avail_index; > > > + u16 used_index; > > > bool ready; > > > struct vdpa_callback cb; > > > bool restore; > > > @@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue { > > > u32 virtq_id; > > > struct mlx5_vdpa_net *ndev; > > > u16 avail_idx; > > > + u16 used_idx; > > > int fw_state; > > > > > > /* keep last in the struct */ > > > @@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net > > > *ndev, struct mlx5_vdpa_virtque > > > > > > obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, > > > obj_context); > > > MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, > > > mvq->avail_idx); > > > + MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, > > > mvq->used_idx); > > > > The saved indexes will apply to the new virtqueue object whenever it > > is created. In virtio spec, these indexes will reset back to zero when > > the virtio device is reset. But I don't see how it's done today. IOW, > > I don't see where avail_idx and used_idx get cleared from the mvq for > > device reset via set_status(). > > > > Right, but this is not strictly related to this patch. I will post > another patch to fix this. Better to post these two patches in a series.Or else it may cause VM reboot problem as that is where the device gets reset. The avail_index did not as the correct value will be written to by driver right after, but used_idx introduced by this patch is supplied by device hence this patch alone would introduce regression. > > BTW, can you describe a secnario that would cause a reset (through > calling set_status()) that happens after the VQ has been used? You can try reboot the guest, that'll be the easy way to test. -Siwei > > > -Siwei > > > > > > > MLX5_SET(virtio_net_q_object, obj_context, > > > queue_feature_bit_mask_12_3, > > > get_features_12_3(ndev->mvdev.actual_features)); > > > vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, > > > virtio_q_context); > > > @@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, > > > struct mlx5_vdpa_virtqueue *m > > > struct mlx5_virtq_attr { > > > u8 state; > > > u16 available_index; > > > + u16 used_index; > > > }; > > > > > > static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct > > > mlx5_vdpa_virtqueue *mvq, > > > @@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net > > > *ndev, struct mlx5_vdpa_virtqueu > > > memset(attr, 0, sizeof(*attr)); > > > attr->state = MLX5_GET(virtio_net_q_object, obj_context, state); > > > attr->available_index = MLX5_GET(virtio_net_q_object, > > > obj_context, hw_available_index); > > > + attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, > > > hw_used_index); > > > kfree(out); > > > return 0; > > > > > > @@ -1610,6 +1615,7 @@ static int save_channel_info(struct mlx5_vdpa_net > > > *ndev, struct mlx5_vdpa_virtqu > > > return err; > > > > > > ri->avail_index = attr.available_index; > > > + ri->used_index = attr.used_index; > > > ri->ready = mvq->ready; > > > ri->num_ent = mvq->num_ent; > > > ri->desc_addr = mvq->desc_addr; > > > @@ -1654,6 +1660,7 @@ static void restore_channels_info(struct > > > mlx5_vdpa_net *ndev) > > > continue; > > > > > > mvq->avail_idx = ri->avail_index; > > > + mvq->used_idx = ri->used_index; > > > mvq->ready = ri->ready; > > > mvq->num_ent = ri->num_ent; > > > mvq->desc_addr = ri->desc_addr; > > > -- > > > 2.29.2 > > >
Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue
On Tue, Feb 2, 2021 at 1:23 AM Eli Cohen wrote: > > On Tue, Feb 02, 2021 at 12:38:51AM -0800, Si-Wei Liu wrote: > > Thanks Eli and Jason for clarifications. See inline. > > > > On Mon, Feb 1, 2021 at 11:06 PM Eli Cohen wrote: > > > > > > On Tue, Feb 02, 2021 at 02:02:25PM +0800, Jason Wang wrote: > > > > > > > > On 2021/2/2 下午12:15, Si-Wei Liu wrote: > > > > > On Mon, Feb 1, 2021 at 7:13 PM Jason Wang wrote: > > > > > > > > > > > > On 2021/2/2 上午3:17, Si-Wei Liu wrote: > > > > > > > On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu > > > > > > > wrote: > > > > > > > > On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen > > > > > > > > wrote: > > > > > > > > > suspend_vq should only suspend the VQ on not save the current > > > > > > > > > available > > > > > > > > > index. This is done when a change of map occurs when the > > > > > > > > > driver calls > > > > > > > > > save_channel_info(). > > > > > > > > Hmmm, suspend_vq() is also called by teardown_vq(), the latter > > > > > > > > of > > > > > > > > which doesn't save the available index as save_channel_info() > > > > > > > > doesn't > > > > > > > > get called in that path at all. How does it handle the case that > > > > > > > > aget_vq_state() is called from userspace (e.g. QEMU) while the > > > > > > > > hardware VQ object was torn down, but userspace still wants to > > > > > > > > access > > > > > > > > the queue index? > > > > > > > > > > > > > > > > Refer to > > > > > > > > https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/ > > > > > > > > > > > > > > > > vhost VQ 0 ring restore failed: -1: Resource temporarily > > > > > > > > unavailable (11) > > > > > > > > vhost VQ 1 ring restore failed: -1: Resource temporarily > > > > > > > > unavailable (11) > > > > > > > > > > > > > > > > QEMU will complain with the above warning while VM is being > > > > > > > > rebooted > > > > > > > > or shut down. > > > > > > > > > > > > > > > > Looks to me either the kernel driver should cover this > > > > > > > > requirement, or > > > > > > > > the userspace has to bear the burden in saving the index and > > > > > > > > not call > > > > > > > > into kernel if VQ is destroyed. > > > > > > > Actually, the userspace doesn't have the insights whether virt > > > > > > > queue > > > > > > > will be destroyed if just changing the device status via > > > > > > > set_status(). > > > > > > > Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave > > > > > > > like > > > > > > > so. Hence this still looks to me to be Mellanox specifics and > > > > > > > mlx5_vdpa implementation detail that shouldn't expose to > > > > > > > userspace. > > > > > > > > > > > > So I think we can simply drop this patch? > > > > > Yep, I think so. To be honest I don't know why it has anything to do > > > > > with the memory hotplug issue. > > > > > > > > > > > > Eli may know more, my understanding is that, during memory hotplut, qemu > > > > need to propagate new memory mappings via set_map(). For mellanox, it > > > > means > > > > it needs to rebuild memory keys, so the virtqueue needs to be suspended. > > > > > > > > > > I think Siwei was asking why the first patch was related to the hotplug > > > issue. > > > > I was thinking how consistency is assured when saving/restoring this > > h/w avail_index against the one in the virtq memory, particularly in > > the region_add/.region_del() context (e.g. the hotplug case). Problem > > is I don't see explicit memory barrier when guest thread updates the > > avail_index, how does the device make sure the
Re: [PATCH] vdpa/mlx5: Restore the hardware used index after change map
On Tue, Feb 2, 2021 at 6:34 AM Eli Cohen wrote: > > When a change of memory map occurs, the hardware resources are destroyed > and then re-created again with the new memory map. In such case, we need > to restore the hardware available and used indices. The driver failed to > restore the used index which is added here. > > Fixes 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices") > Signed-off-by: Eli Cohen > --- > This patch is being sent again a single patch the fixes hot memory > addtion to a qemy process. > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 88dde3455bfd..839f57c64a6f 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -87,6 +87,7 @@ struct mlx5_vq_restore_info { > u64 device_addr; > u64 driver_addr; > u16 avail_index; > + u16 used_index; > bool ready; > struct vdpa_callback cb; > bool restore; > @@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue { > u32 virtq_id; > struct mlx5_vdpa_net *ndev; > u16 avail_idx; > + u16 used_idx; > int fw_state; > > /* keep last in the struct */ > @@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net *ndev, > struct mlx5_vdpa_virtque > > obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, obj_context); > MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, > mvq->avail_idx); > + MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, > mvq->used_idx); The saved indexes will apply to the new virtqueue object whenever it is created. In virtio spec, these indexes will reset back to zero when the virtio device is reset. But I don't see how it's done today. IOW, I don't see where avail_idx and used_idx get cleared from the mvq for device reset via set_status(). -Siwei > MLX5_SET(virtio_net_q_object, obj_context, > queue_feature_bit_mask_12_3, > get_features_12_3(ndev->mvdev.actual_features)); > vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, > virtio_q_context); > @@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, > struct mlx5_vdpa_virtqueue *m > struct mlx5_virtq_attr { > u8 state; > u16 available_index; > + u16 used_index; > }; > > static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct > mlx5_vdpa_virtqueue *mvq, > @@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, > struct mlx5_vdpa_virtqueu > memset(attr, 0, sizeof(*attr)); > attr->state = MLX5_GET(virtio_net_q_object, obj_context, state); > attr->available_index = MLX5_GET(virtio_net_q_object, obj_context, > hw_available_index); > + attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, > hw_used_index); > kfree(out); > return 0; > > @@ -1610,6 +1615,7 @@ static int save_channel_info(struct mlx5_vdpa_net > *ndev, struct mlx5_vdpa_virtqu > return err; > > ri->avail_index = attr.available_index; > + ri->used_index = attr.used_index; > ri->ready = mvq->ready; > ri->num_ent = mvq->num_ent; > ri->desc_addr = mvq->desc_addr; > @@ -1654,6 +1660,7 @@ static void restore_channels_info(struct mlx5_vdpa_net > *ndev) > continue; > > mvq->avail_idx = ri->avail_index; > + mvq->used_idx = ri->used_index; > mvq->ready = ri->ready; > mvq->num_ent = ri->num_ent; > mvq->desc_addr = ri->desc_addr; > -- > 2.29.2 >
Re: [PATCH] xen/netback: avoid race in xenvif_rx_ring_slots_available()
On Tue, Feb 02, 2021 at 08:09:38AM +0100, Juergen Gross wrote: > Since commit 23025393dbeb3b8b3 ("xen/netback: use lateeoi irq binding") > xenvif_rx_ring_slots_available() is no longer called only from the rx > queue kernel thread, so it needs to access the rx queue with the > associated queue held. > > Reported-by: Igor Druzhinin > Fixes: 23025393dbeb3b8b3 ("xen/netback: use lateeoi irq binding") > Cc: sta...@vger.kernel.org > Signed-off-by: Juergen Gross Acked-by: Wei Liu
Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue
Thanks Eli and Jason for clarifications. See inline. On Mon, Feb 1, 2021 at 11:06 PM Eli Cohen wrote: > > On Tue, Feb 02, 2021 at 02:02:25PM +0800, Jason Wang wrote: > > > > On 2021/2/2 下午12:15, Si-Wei Liu wrote: > > > On Mon, Feb 1, 2021 at 7:13 PM Jason Wang wrote: > > > > > > > > On 2021/2/2 上午3:17, Si-Wei Liu wrote: > > > > > On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu > > > > > wrote: > > > > > > On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen wrote: > > > > > > > suspend_vq should only suspend the VQ on not save the current > > > > > > > available > > > > > > > index. This is done when a change of map occurs when the driver > > > > > > > calls > > > > > > > save_channel_info(). > > > > > > Hmmm, suspend_vq() is also called by teardown_vq(), the latter of > > > > > > which doesn't save the available index as save_channel_info() > > > > > > doesn't > > > > > > get called in that path at all. How does it handle the case that > > > > > > aget_vq_state() is called from userspace (e.g. QEMU) while the > > > > > > hardware VQ object was torn down, but userspace still wants to > > > > > > access > > > > > > the queue index? > > > > > > > > > > > > Refer to > > > > > > https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/ > > > > > > > > > > > > vhost VQ 0 ring restore failed: -1: Resource temporarily > > > > > > unavailable (11) > > > > > > vhost VQ 1 ring restore failed: -1: Resource temporarily > > > > > > unavailable (11) > > > > > > > > > > > > QEMU will complain with the above warning while VM is being rebooted > > > > > > or shut down. > > > > > > > > > > > > Looks to me either the kernel driver should cover this requirement, > > > > > > or > > > > > > the userspace has to bear the burden in saving the index and not > > > > > > call > > > > > > into kernel if VQ is destroyed. > > > > > Actually, the userspace doesn't have the insights whether virt queue > > > > > will be destroyed if just changing the device status via set_status(). > > > > > Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave like > > > > > so. Hence this still looks to me to be Mellanox specifics and > > > > > mlx5_vdpa implementation detail that shouldn't expose to userspace. > > > > > > > > So I think we can simply drop this patch? > > > Yep, I think so. To be honest I don't know why it has anything to do > > > with the memory hotplug issue. > > > > > > Eli may know more, my understanding is that, during memory hotplut, qemu > > need to propagate new memory mappings via set_map(). For mellanox, it means > > it needs to rebuild memory keys, so the virtqueue needs to be suspended. > > > > I think Siwei was asking why the first patch was related to the hotplug > issue. I was thinking how consistency is assured when saving/restoring this h/w avail_index against the one in the virtq memory, particularly in the region_add/.region_del() context (e.g. the hotplug case). Problem is I don't see explicit memory barrier when guest thread updates the avail_index, how does the device make sure the h/w avail_index is uptodate while guest may race with updating the virtq's avail_index in the mean while? Maybe I completely miss something obvious? Thanks, -Siwei > > But you're correct. When memory is added, I get a new memory map. This > requires me to build a new memory key object which covers the new memory > map. Since the virtqueue objects are referencing this memory key, I need > to destroy them and build new ones referncing the new memory key. > > > Thanks > > > > > > > > > > -Siwei > > > > > > > Thanks > > > > > > > > > > > > > > -Siwei > > > > > > > > > > > > > > > > > > > Signed-off-by: Eli Cohen > > > > > > > --- > > > > > > >drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 > > > > > > >1 file changed, 8 deletions(-) > > > > > > > > > > > > > > diff --gi
Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue
On Mon, Feb 1, 2021 at 7:13 PM Jason Wang wrote: > > > On 2021/2/2 上午3:17, Si-Wei Liu wrote: > > On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu wrote: > >> On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen wrote: > >>> suspend_vq should only suspend the VQ on not save the current available > >>> index. This is done when a change of map occurs when the driver calls > >>> save_channel_info(). > >> Hmmm, suspend_vq() is also called by teardown_vq(), the latter of > >> which doesn't save the available index as save_channel_info() doesn't > >> get called in that path at all. How does it handle the case that > >> aget_vq_state() is called from userspace (e.g. QEMU) while the > >> hardware VQ object was torn down, but userspace still wants to access > >> the queue index? > >> > >> Refer to > >> https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/ > >> > >> vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11) > >> vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11) > >> > >> QEMU will complain with the above warning while VM is being rebooted > >> or shut down. > >> > >> Looks to me either the kernel driver should cover this requirement, or > >> the userspace has to bear the burden in saving the index and not call > >> into kernel if VQ is destroyed. > > Actually, the userspace doesn't have the insights whether virt queue > > will be destroyed if just changing the device status via set_status(). > > Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave like > > so. Hence this still looks to me to be Mellanox specifics and > > mlx5_vdpa implementation detail that shouldn't expose to userspace. > > > So I think we can simply drop this patch? Yep, I think so. To be honest I don't know why it has anything to do with the memory hotplug issue. -Siwei > > Thanks > > > >> -Siwei > >> > >> > >>> Signed-off-by: Eli Cohen > >>> --- > >>> drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 > >>> 1 file changed, 8 deletions(-) > >>> > >>> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > >>> b/drivers/vdpa/mlx5/net/mlx5_vnet.c > >>> index 88dde3455bfd..549ded074ff3 100644 > >>> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > >>> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > >>> @@ -1148,8 +1148,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, > >>> struct mlx5_vdpa_virtqueue *mvq) > >>> > >>> static void suspend_vq(struct mlx5_vdpa_net *ndev, struct > >>> mlx5_vdpa_virtqueue *mvq) > >>> { > >>> - struct mlx5_virtq_attr attr; > >>> - > >>> if (!mvq->initialized) > >>> return; > >>> > >>> @@ -1158,12 +1156,6 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, > >>> struct mlx5_vdpa_virtqueue *m > >>> > >>> if (modify_virtqueue(ndev, mvq, > >>> MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND)) > >>> mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend > >>> failed\n"); > >>> - > >>> - if (query_virtqueue(ndev, mvq, &attr)) { > >>> - mlx5_vdpa_warn(&ndev->mvdev, "failed to query > >>> virtqueue\n"); > >>> - return; > >>> - } > >>> - mvq->avail_idx = attr.available_index; > >>> } > >>> > >>> static void suspend_vqs(struct mlx5_vdpa_net *ndev) > >>> -- > >>> 2.29.2 > >>> >
Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue
On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu wrote: > > On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen wrote: > > > > suspend_vq should only suspend the VQ on not save the current available > > index. This is done when a change of map occurs when the driver calls > > save_channel_info(). > > Hmmm, suspend_vq() is also called by teardown_vq(), the latter of > which doesn't save the available index as save_channel_info() doesn't > get called in that path at all. How does it handle the case that > aget_vq_state() is called from userspace (e.g. QEMU) while the > hardware VQ object was torn down, but userspace still wants to access > the queue index? > > Refer to > https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/ > > vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11) > vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11) > > QEMU will complain with the above warning while VM is being rebooted > or shut down. > > Looks to me either the kernel driver should cover this requirement, or > the userspace has to bear the burden in saving the index and not call > into kernel if VQ is destroyed. Actually, the userspace doesn't have the insights whether virt queue will be destroyed if just changing the device status via set_status(). Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave like so. Hence this still looks to me to be Mellanox specifics and mlx5_vdpa implementation detail that shouldn't expose to userspace. > > -Siwei > > > > > > Signed-off-by: Eli Cohen > > --- > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 > > 1 file changed, 8 deletions(-) > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > index 88dde3455bfd..549ded074ff3 100644 > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > > @@ -1148,8 +1148,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, > > struct mlx5_vdpa_virtqueue *mvq) > > > > static void suspend_vq(struct mlx5_vdpa_net *ndev, struct > > mlx5_vdpa_virtqueue *mvq) > > { > > - struct mlx5_virtq_attr attr; > > - > > if (!mvq->initialized) > > return; > > > > @@ -1158,12 +1156,6 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, > > struct mlx5_vdpa_virtqueue *m > > > > if (modify_virtqueue(ndev, mvq, > > MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND)) > > mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n"); > > - > > - if (query_virtqueue(ndev, mvq, &attr)) { > > - mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n"); > > - return; > > - } > > - mvq->avail_idx = attr.available_index; > > } > > > > static void suspend_vqs(struct mlx5_vdpa_net *ndev) > > -- > > 2.29.2 > >
Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue
On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen wrote: > > suspend_vq should only suspend the VQ on not save the current available > index. This is done when a change of map occurs when the driver calls > save_channel_info(). Hmmm, suspend_vq() is also called by teardown_vq(), the latter of which doesn't save the available index as save_channel_info() doesn't get called in that path at all. How does it handle the case that aget_vq_state() is called from userspace (e.g. QEMU) while the hardware VQ object was torn down, but userspace still wants to access the queue index? Refer to https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/ vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11) vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11) QEMU will complain with the above warning while VM is being rebooted or shut down. Looks to me either the kernel driver should cover this requirement, or the userspace has to bear the burden in saving the index and not call into kernel if VQ is destroyed. -Siwei > > Signed-off-by: Eli Cohen > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 > 1 file changed, 8 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 88dde3455bfd..549ded074ff3 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -1148,8 +1148,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, struct > mlx5_vdpa_virtqueue *mvq) > > static void suspend_vq(struct mlx5_vdpa_net *ndev, struct > mlx5_vdpa_virtqueue *mvq) > { > - struct mlx5_virtq_attr attr; > - > if (!mvq->initialized) > return; > > @@ -1158,12 +1156,6 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, > struct mlx5_vdpa_virtqueue *m > > if (modify_virtqueue(ndev, mvq, > MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND)) > mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n"); > - > - if (query_virtqueue(ndev, mvq, &attr)) { > - mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n"); > - return; > - } > - mvq->avail_idx = attr.available_index; > } > > static void suspend_vqs(struct mlx5_vdpa_net *ndev) > -- > 2.29.2 >
Re: [PATCH v3 hyperv-next 0/4] Drivers: hv: vmbus: Restrict devices and configurations on 'isolated' guests
On Mon, Feb 01, 2021 at 03:48:10PM +0100, Andrea Parri (Microsoft) wrote: > Andrea Parri (Microsoft) (4): > x86/hyperv: Load/save the Isolation Configuration leaf > Drivers: hv: vmbus: Restrict vmbus_devices on isolated guests > Drivers: hv: vmbus: Enforce 'VMBus version >= 5.2' on isolated guests > hv_netvsc: Restrict configurations on isolated guests Applied to hyperv-next. Thanks. Wei.
Re: [PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values
On Sat, Jan 16, 2021 at 02:02:01PM +0100, Andrea Parri wrote: > On Fri, Jan 15, 2021 at 08:30:22PM -0800, Jakub Kicinski wrote: > > On Thu, 14 Jan 2021 21:26:28 +0100 Andrea Parri (Microsoft) wrote: > > > For additional robustness in the face of Hyper-V errors or malicious > > > behavior, validate all values that originate from packets that Hyper-V > > > has sent to the guest. Ensure that invalid values cannot cause indexing > > > off the end of an array, or subvert an existing validation via integer > > > overflow. Ensure that outgoing packets do not have any leftover guest > > > memory that has not been zeroed out. > > > > > > Reported-by: Juan Vazquez > > > Signed-off-by: Andrea Parri (Microsoft) > > > Cc: "David S. Miller" > > > Cc: Jakub Kicinski > > > Cc: Alexei Starovoitov > > > Cc: Daniel Borkmann > > > Cc: Andrii Nakryiko > > > Cc: Martin KaFai Lau > > > Cc: Song Liu > > > Cc: Yonghong Song > > > Cc: John Fastabend > > > Cc: KP Singh > > > Cc: netdev@vger.kernel.org > > > Cc: b...@vger.kernel.org > > > --- > > > Applies to 5.11-rc3 (and hyperv-next). > > > > So this is for hyperv-next or should we take it via netdev trees? > > No preference, either way is good for me. To be clear: There is no dependency on any patch in hyperv-next, right? That's my understanding, but I would like to confirm it. Wei. > > Thanks, > Andrea
Re: [PATCH v3] Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer
On Tue, Dec 22, 2020 at 01:59:34PM +, Michael Kelley wrote: > From: Andrea Parri (Microsoft) Sent: Monday, > December 7, 2020 8:53 PM > > > > From: Andres Beltran > > > > Pointers to ring-buffer packets sent by Hyper-V are used within the > > guest VM. Hyper-V can send packets with erroneous values or modify > > packet fields after they are processed by the guest. To defend > > against these scenarios, return a copy of the incoming VMBus packet > > after validating its length and offset fields in hv_pkt_iter_first(). > > In this way, the packet can no longer be modified by the host. > > > > Signed-off-by: Andres Beltran > > Co-developed-by: Andrea Parri (Microsoft) > > Signed-off-by: Andrea Parri (Microsoft) > > Cc: "David S. Miller" > > Cc: Jakub Kicinski > > Cc: "James E.J. Bottomley" > > Cc: "Martin K. Petersen" > > Cc: netdev@vger.kernel.org > > Cc: linux-s...@vger.kernel.org > > --- > > drivers/hv/channel.c | 9 ++-- > > drivers/hv/hv_fcopy.c | 1 + > > drivers/hv/hv_kvp.c | 1 + > > drivers/hv/hyperv_vmbus.h | 2 +- > > drivers/hv/ring_buffer.c | 82 ++- > > drivers/net/hyperv/hyperv_net.h | 3 ++ > > drivers/net/hyperv/netvsc.c | 2 + > > drivers/net/hyperv/rndis_filter.c | 2 + > > drivers/scsi/storvsc_drv.c| 10 > > include/linux/hyperv.h| 48 +++--- > > net/vmw_vsock/hyperv_transport.c | 4 +- > > 11 files changed, 139 insertions(+), 25 deletions(-) > > > > Reviewed-by: Michael Kelley Applied to hyperv-next.
Re: [PATCH] hv_netvsc: Validate number of allocated sub-channels
On Wed, Nov 18, 2020 at 05:37:15PM -0800, Jakub Kicinski wrote: > On Wed, 18 Nov 2020 16:33:10 +0100 Andrea Parri (Microsoft) wrote: > > Lack of validation could lead to out-of-bound reads and information > > leaks (cf. usage of nvdev->chan_table[]). Check that the number of > > allocated sub-channels fits into the expected range. > > > > Suggested-by: Saruhan Karademir > > Signed-off-by: Andrea Parri (Microsoft) > > Cc: "David S. Miller" > > Cc: Jakub Kicinski > > Cc: netdev@vger.kernel.org > > Acked-by: Jakub Kicinski Applied to hyperv-next.
Re: [PATCH 2/2] x86: make hyperv support optional
On Tue, Nov 17, 2020 at 09:23:08PM +0100, Enrico Weigelt, metux IT consult wrote: > Make it possible to opt-out from hyperv support. > "Hyper-V support". Have you tested this patch? If so, how? > Signed-off-by: Enrico Weigelt, metux IT consult > --- > arch/x86/Kconfig | 7 +++ > arch/x86/kernel/cpu/Makefile | 4 ++-- > arch/x86/kernel/cpu/hypervisor.c | 2 ++ > drivers/hv/Kconfig | 2 +- > 4 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index c227c1fa0091..60aab344d6ab 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -808,6 +808,13 @@ config VMWARE_GUEST > This option enables several optimizations for running under the > VMware hypervisor. > > +config HYPERV_GUEST > + bool "HyperV Guest support" Hyper-V here. > + default y > + help > + This option enables several optimizations for running under the > + HyperV hypervisor. > + "for running under Hyper-V". > config KVM_GUEST > bool "KVM Guest support (including kvmclock)" > depends on PARAVIRT > diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile > index a615b0152bf0..5536b801cb44 100644 > --- a/arch/x86/kernel/cpu/Makefile > +++ b/arch/x86/kernel/cpu/Makefile > @@ -51,9 +51,9 @@ obj-$(CONFIG_X86_CPU_RESCTRL) += resctrl/ > > obj-$(CONFIG_X86_LOCAL_APIC) += perfctr-watchdog.o > > -obj-$(CONFIG_HYPERVISOR_GUEST) += hypervisor.o mshyperv.o > +obj-$(CONFIG_HYPERVISOR_GUEST) += hypervisor.o > obj-$(CONFIG_VMWARE_GUEST) += vmware.o > - > +obj-$(CONFIG_HYPERV_GUEST) += mshyperv.o > obj-$(CONFIG_ACRN_GUEST) += acrn.o > > ifdef CONFIG_X86_FEATURE_NAMES > diff --git a/arch/x86/kernel/cpu/hypervisor.c > b/arch/x86/kernel/cpu/hypervisor.c > index c0e770a224aa..32d6b2084d05 100644 > --- a/arch/x86/kernel/cpu/hypervisor.c > +++ b/arch/x86/kernel/cpu/hypervisor.c > @@ -37,7 +37,9 @@ static const __initconst struct hypervisor_x86 * const > hypervisors[] = > #ifdef CONFIG_VMWARE_GUEST > &x86_hyper_vmware, > #endif > +#ifdef CONFIG_HYPERV_GUEST > &x86_hyper_ms_hyperv, > +#endif > #ifdef CONFIG_KVM_GUEST > &x86_hyper_kvm, > #endif > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig > index 79e5356a737a..7b3094c59a81 100644 > --- a/drivers/hv/Kconfig > +++ b/drivers/hv/Kconfig > @@ -4,7 +4,7 @@ menu "Microsoft Hyper-V guest support" > > config HYPERV > tristate "Microsoft Hyper-V client drivers" > - depends on X86 && ACPI && X86_LOCAL_APIC && HYPERVISOR_GUEST > + depends on X86 && ACPI && X86_LOCAL_APIC && HYPERV_GUEST > select PARAVIRT > select X86_HV_CALLBACK_VECTOR > help Maybe that one should be moved to x86/Kconfig and used instead? Wei. > -- > 2.11.0 >
Re: [PATCH v9 0/3] Drivers: hv: vmbus: vmbus_requestor data structure for VMBus hardening
On Mon, Nov 09, 2020 at 11:03:59AM +0100, Andrea Parri (Microsoft) wrote: > Currently, VMbus drivers use pointers into guest memory as request IDs > for interactions with Hyper-V. To be more robust in the face of errors > or malicious behavior from a compromised Hyper-V, avoid exposing > guest memory addresses to Hyper-V. Also avoid Hyper-V giving back a > bad request ID that is then treated as the address of a guest data > structure with no validation. Instead, encapsulate these memory > addresses and provide small integers as request IDs. > > The first patch creates the definitions for the data structure, provides > helper methods to generate new IDs and retrieve data, and > allocates/frees the memory needed for vmbus_requestor. > > The second and third patches make use of vmbus_requestor to send request > IDs to Hyper-V in storvsc and netvsc respectively. > > The series is based on 5.10-rc3. Changelog in the actual patches. Applied to hyperv-next. Thanks. I also corrected the email address in my reviewed-by tags while committing -- should've use my @kernel.org address, not @xen.org. Wei.
Re: [PATCH v9 3/3] hv_netvsc: Use vmbus_requestor to generate transaction IDs for VMBus hardening
On Mon, Nov 09, 2020 at 11:04:02AM +0100, Andrea Parri (Microsoft) wrote: > From: Andres Beltran > > Currently, pointers to guest memory are passed to Hyper-V as > transaction IDs in netvsc. In the face of errors or malicious > behavior in Hyper-V, netvsc should not expose or trust the transaction > IDs returned by Hyper-V to be valid guest memory addresses. Instead, > use small integers generated by vmbus_requestor as requests > (transaction) IDs. > > Signed-off-by: Andres Beltran > Co-developed-by: Andrea Parri (Microsoft) > Signed-off-by: Andrea Parri (Microsoft) > Reviewed-by: Michael Kelley > Acked-by: Jakub Kicinski > Cc: "David S. Miller" > Cc: Jakub Kicinski > Cc: netdev@vger.kernel.org Reviewed-by: Wei Liu
Re: [PATCH 05/12] net: xen-netback: xenbus: Demote nonconformant kernel-doc headers
On Wed, Nov 04, 2020 at 09:06:03AM +, Lee Jones wrote: > Fixes the following W=1 kernel build warning(s): > > drivers/net/xen-netback/xenbus.c:419: warning: Function parameter or member > 'dev' not described in 'frontend_changed' > drivers/net/xen-netback/xenbus.c:419: warning: Function parameter or member > 'frontend_state' not described in 'frontend_changed' > drivers/net/xen-netback/xenbus.c:1001: warning: Function parameter or member > 'dev' not described in 'netback_probe' > drivers/net/xen-netback/xenbus.c:1001: warning: Function parameter or member > 'id' not described in 'netback_probe' > > Cc: Wei Liu If this is ever needed: Acked-by: Wei Liu
Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path
On 10/29/2020 2:53 PM, Michael S. Tsirkin wrote: On Thu, Oct 15, 2020 at 01:17:14PM -0700, si-wei liu wrote: On 10/15/2020 6:11 AM, Michael S. Tsirkin wrote: On Thu, Oct 15, 2020 at 02:15:32PM +0800, Jason Wang wrote: On 2020/10/14 上午7:42, si-wei liu wrote: So what I suggest is to fix the pinning leakage first and do the possible optimization on top (which is still questionable to me). OK. Unfortunately, this was picked and got merged in upstream. So I will post a follow up patch set to 1) revert the commit to the original __get_free_page() implementation, and 2) fix the accounting and leakage on top. Will it be fine? Fine. Thanks Fine by me too. Thanks, Michael & Jason. I will post the fix shortly. Stay tuned. -Siwei did I miss the patch? The patch had been posted last Friday. See this thread: https://lore.kernel.org/virtualization/1604043944-4897-2-git-send-email-si-wei@oracle.com/ -Siwei
Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path
On 10/29/2020 2:53 PM, Michael S. Tsirkin wrote: On Thu, Oct 15, 2020 at 01:17:14PM -0700, si-wei liu wrote: On 10/15/2020 6:11 AM, Michael S. Tsirkin wrote: On Thu, Oct 15, 2020 at 02:15:32PM +0800, Jason Wang wrote: On 2020/10/14 上午7:42, si-wei liu wrote: So what I suggest is to fix the pinning leakage first and do the possible optimization on top (which is still questionable to me). OK. Unfortunately, this was picked and got merged in upstream. So I will post a follow up patch set to 1) revert the commit to the original __get_free_page() implementation, and 2) fix the accounting and leakage on top. Will it be fine? Fine. Thanks Fine by me too. Thanks, Michael & Jason. I will post the fix shortly. Stay tuned. -Siwei did I miss the patch? You didn't, sorry. I was looking into a way to speed up the boot time for large memory guest by multi-threading the page pinning process, and it turns out I need more time on that. Will post the fix I have now soon, hopefully tomorrow. -Siwei
Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path
On 10/15/2020 6:11 AM, Michael S. Tsirkin wrote: On Thu, Oct 15, 2020 at 02:15:32PM +0800, Jason Wang wrote: On 2020/10/14 上午7:42, si-wei liu wrote: So what I suggest is to fix the pinning leakage first and do the possible optimization on top (which is still questionable to me). OK. Unfortunately, this was picked and got merged in upstream. So I will post a follow up patch set to 1) revert the commit to the original __get_free_page() implementation, and 2) fix the accounting and leakage on top. Will it be fine? Fine. Thanks Fine by me too. Thanks, Michael & Jason. I will post the fix shortly. Stay tuned. -Siwei
Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path
On 10/9/2020 7:27 PM, Jason Wang wrote: On 2020/10/3 下午1:02, Si-Wei Liu wrote: Pinned pages are not properly accounted particularly when mapping error occurs on IOTLB update. Clean up dangling pinned pages for the error path. As the inflight pinned pages, specifically for memory region that strides across multiple chunks, would need more than one free page for book keeping and accounting. For simplicity, pin pages for all memory in the IOVA range in one go rather than have multiple pin_user_pages calls to make up the entire region. This way it's easier to track and account the pages already mapped, particularly for clean-up in the error path. Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend") Signed-off-by: Si-Wei Liu --- Changes in v3: - Factor out vhost_vdpa_map() change to a separate patch Changes in v2: - Fix incorrect target SHA1 referenced drivers/vhost/vdpa.c | 119 ++- 1 file changed, 71 insertions(+), 48 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 0f27919..dad41dae 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -595,21 +595,19 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, struct vhost_dev *dev = &v->vdev; struct vhost_iotlb *iotlb = dev->iotlb; struct page **page_list; -unsigned long list_size = PAGE_SIZE / sizeof(struct page *); +struct vm_area_struct **vmas; unsigned int gup_flags = FOLL_LONGTERM; -unsigned long npages, cur_base, map_pfn, last_pfn = 0; -unsigned long locked, lock_limit, pinned, i; +unsigned long map_pfn, last_pfn = 0; +unsigned long npages, lock_limit; +unsigned long i, nmap = 0; u64 iova = msg->iova; +long pinned; int ret = 0; if (vhost_iotlb_itree_first(iotlb, msg->iova, msg->iova + msg->size - 1)) return -EEXIST; -page_list = (struct page **) __get_free_page(GFP_KERNEL); -if (!page_list) -return -ENOMEM; - if (msg->perm & VHOST_ACCESS_WO) gup_flags |= FOLL_WRITE; @@ -617,61 +615,86 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, if (!npages) return -EINVAL; +page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL); +vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *), + GFP_KERNEL); This will result high order memory allocation which was what the code tried to avoid originally. Using an unlimited size will cause a lot of side effects consider VM or userspace may try to pin several TB of memory. Hmmm, that's a good point. Indeed, if the guest memory demand is huge or the host system is running short of free pages, kvmalloc will be problematic and less efficient than the __get_free_page implementation. +if (!page_list || !vmas) { +ret = -ENOMEM; +goto free; +} Any reason that you want to use vmas? Without providing custom vmas, it's subject to high order allocation failure. While page_list and vmas can now fallback to virtual memory allocation if need be. + mmap_read_lock(dev->mm); -locked = atomic64_add_return(npages, &dev->mm->pinned_vm); lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - -if (locked > lock_limit) { +if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) { ret = -ENOMEM; -goto out; +goto unlock; } -cur_base = msg->uaddr & PAGE_MASK; -iova &= PAGE_MASK; +pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags, +page_list, vmas); +if (npages != pinned) { +if (pinned < 0) { +ret = pinned; +} else { +unpin_user_pages(page_list, pinned); +ret = -ENOMEM; +} +goto unlock; +} -while (npages) { -pinned = min_t(unsigned long, npages, list_size); -ret = pin_user_pages(cur_base, pinned, - gup_flags, page_list, NULL); -if (ret != pinned) -goto out; - -if (!last_pfn) -map_pfn = page_to_pfn(page_list[0]); - -for (i = 0; i < ret; i++) { -unsigned long this_pfn = page_to_pfn(page_list[i]); -u64 csize; - -if (last_pfn && (this_pfn != last_pfn + 1)) { -/* Pin a contiguous chunk of memory */ -csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT; -if (vhost_vdpa_map(v, iova, csize, - map_pfn << PAGE_SHIFT, - msg->perm)) -goto out; -map_pfn = this_pfn; -iova += csize; +iova &= PAGE_MASK; +map_pfn = page_to_pfn(page_list[0]); + +/* One more iterati
[PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path
Pinned pages are not properly accounted particularly when mapping error occurs on IOTLB update. Clean up dangling pinned pages for the error path. As the inflight pinned pages, specifically for memory region that strides across multiple chunks, would need more than one free page for book keeping and accounting. For simplicity, pin pages for all memory in the IOVA range in one go rather than have multiple pin_user_pages calls to make up the entire region. This way it's easier to track and account the pages already mapped, particularly for clean-up in the error path. Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend") Signed-off-by: Si-Wei Liu --- Changes in v3: - Factor out vhost_vdpa_map() change to a separate patch Changes in v2: - Fix incorrect target SHA1 referenced drivers/vhost/vdpa.c | 119 ++- 1 file changed, 71 insertions(+), 48 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 0f27919..dad41dae 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -595,21 +595,19 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, struct vhost_dev *dev = &v->vdev; struct vhost_iotlb *iotlb = dev->iotlb; struct page **page_list; - unsigned long list_size = PAGE_SIZE / sizeof(struct page *); + struct vm_area_struct **vmas; unsigned int gup_flags = FOLL_LONGTERM; - unsigned long npages, cur_base, map_pfn, last_pfn = 0; - unsigned long locked, lock_limit, pinned, i; + unsigned long map_pfn, last_pfn = 0; + unsigned long npages, lock_limit; + unsigned long i, nmap = 0; u64 iova = msg->iova; + long pinned; int ret = 0; if (vhost_iotlb_itree_first(iotlb, msg->iova, msg->iova + msg->size - 1)) return -EEXIST; - page_list = (struct page **) __get_free_page(GFP_KERNEL); - if (!page_list) - return -ENOMEM; - if (msg->perm & VHOST_ACCESS_WO) gup_flags |= FOLL_WRITE; @@ -617,61 +615,86 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, if (!npages) return -EINVAL; + page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL); + vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *), + GFP_KERNEL); + if (!page_list || !vmas) { + ret = -ENOMEM; + goto free; + } + mmap_read_lock(dev->mm); - locked = atomic64_add_return(npages, &dev->mm->pinned_vm); lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - - if (locked > lock_limit) { + if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) { ret = -ENOMEM; - goto out; + goto unlock; } - cur_base = msg->uaddr & PAGE_MASK; - iova &= PAGE_MASK; + pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags, + page_list, vmas); + if (npages != pinned) { + if (pinned < 0) { + ret = pinned; + } else { + unpin_user_pages(page_list, pinned); + ret = -ENOMEM; + } + goto unlock; + } - while (npages) { - pinned = min_t(unsigned long, npages, list_size); - ret = pin_user_pages(cur_base, pinned, -gup_flags, page_list, NULL); - if (ret != pinned) - goto out; - - if (!last_pfn) - map_pfn = page_to_pfn(page_list[0]); - - for (i = 0; i < ret; i++) { - unsigned long this_pfn = page_to_pfn(page_list[i]); - u64 csize; - - if (last_pfn && (this_pfn != last_pfn + 1)) { - /* Pin a contiguous chunk of memory */ - csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT; - if (vhost_vdpa_map(v, iova, csize, - map_pfn << PAGE_SHIFT, - msg->perm)) - goto out; - map_pfn = this_pfn; - iova += csize; + iova &= PAGE_MASK; + map_pfn = page_to_pfn(page_list[0]); + + /* One more iteration to avoid extra vdpa_map() call out of loop. */ + for (i = 0; i <= npages; i++) { + unsigned long this_pfn; + u64 csize; + + /* The last chunk may have no valid PFN next t
[PATCH v3 0/2] vhost-vdpa mapping error path fixes
Commit 4c8cf31885f6 ("vhost: introduce vDPA-based backend") has following issues in the failure path of IOTLB update: 1) vhost_vdpa_map() does not clean up dangling iotlb entry upon mapping failure 2) vhost_vdpa_process_iotlb_update() has leakage of pinned pages in case of vhost_vdpa_map() failure This patchset attempts to address the above issues. Changes in v3: - Factor out changes in vhost_vdpa_map() and the fix for page pinning leak to separate patches (Jason) --- Si-Wei Liu (2): vhost-vdpa: fix vhost_vdpa_map() on error condition vhost-vdpa: fix page pinning leakage in error path drivers/vhost/vdpa.c | 122 +++ 1 file changed, 74 insertions(+), 48 deletions(-) -- 1.8.3.1
[PATCH v3 1/2] vhost-vdpa: fix vhost_vdpa_map() on error condition
vhost_vdpa_map() should remove the iotlb entry just added if the corresponding mapping fails to set up properly. Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend") Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 796fe97..0f27919 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -565,6 +565,9 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, perm_to_iommu_flags(perm)); } + if (r) + vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1); + return r; } -- 1.8.3.1
Re: [PATCH] vdpa/mlx5: should keep avail_index despite device status
+ Eli. On Thu, Oct 1, 2020 at 2:02 PM Si-Wei Liu wrote: > > A VM with mlx5 vDPA has below warnings while being reset: > > vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11) > vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11) > > We should allow userspace emulating the virtio device be > able to get to vq's avail_index, regardless of vDPA device > status. Save the index that was last seen when virtq was > stopped, so that userspace doesn't complain. > > Signed-off-by: Si-Wei Liu > --- > drivers/vdpa/mlx5/net/mlx5_vnet.c | 20 ++-- > 1 file changed, 14 insertions(+), 6 deletions(-) > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c > b/drivers/vdpa/mlx5/net/mlx5_vnet.c > index 70676a6..74264e59 100644 > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c > @@ -1133,15 +1133,17 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, > struct mlx5_vdpa_virtqueue *m > if (!mvq->initialized) > return; > > - if (query_virtqueue(ndev, mvq, &attr)) { > - mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n"); > - return; > - } > if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY) > return; > > if (modify_virtqueue(ndev, mvq, > MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND)) > mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n"); > + > + if (query_virtqueue(ndev, mvq, &attr)) { > + mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n"); > + return; > + } > + mvq->avail_idx = attr.available_index; > } > > static void suspend_vqs(struct mlx5_vdpa_net *ndev) > @@ -1411,8 +1413,14 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device > *vdev, u16 idx, struct vdpa > struct mlx5_virtq_attr attr; > int err; > > - if (!mvq->initialized) > - return -EAGAIN; > + /* If the virtq object was destroyed, use the value saved at > +* the last minute of suspend_vq. This caters for userspace > +* that cares about emulating the index after vq is stopped. > +*/ > + if (!mvq->initialized) { > + state->avail_index = mvq->avail_idx; > + return 0; > + } > > err = query_virtqueue(ndev, mvq, &attr); > if (err) { > -- > 1.8.3.1 >
[PATCH v2] vhost-vdpa: fix page pinning leakage in error path
Pinned pages are not properly accounted particularly when mapping error occurs on IOTLB update. Clean up dangling pinned pages for the error path. As the inflight pinned pages, specifically for memory region that strides across multiple chunks, would need more than one free page for book keeping and accounting. For simplicity, pin pages for all memory in the IOVA range in one go rather than have multiple pin_user_pages calls to make up the entire region. This way it's easier to track and account the pages already mapped, particularly for clean-up in the error path. Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend") Signed-off-by: Si-Wei Liu --- Changes in v2: - Fix incorrect target SHA1 referenced drivers/vhost/vdpa.c | 121 +++ 1 file changed, 73 insertions(+), 48 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 796fe97..abc4aa2 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -565,6 +565,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, perm_to_iommu_flags(perm)); } + if (r) + vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1); return r; } @@ -592,21 +594,19 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, struct vhost_dev *dev = &v->vdev; struct vhost_iotlb *iotlb = dev->iotlb; struct page **page_list; - unsigned long list_size = PAGE_SIZE / sizeof(struct page *); + struct vm_area_struct **vmas; unsigned int gup_flags = FOLL_LONGTERM; - unsigned long npages, cur_base, map_pfn, last_pfn = 0; - unsigned long locked, lock_limit, pinned, i; + unsigned long map_pfn, last_pfn = 0; + unsigned long npages, lock_limit; + unsigned long i, nmap = 0; u64 iova = msg->iova; + long pinned; int ret = 0; if (vhost_iotlb_itree_first(iotlb, msg->iova, msg->iova + msg->size - 1)) return -EEXIST; - page_list = (struct page **) __get_free_page(GFP_KERNEL); - if (!page_list) - return -ENOMEM; - if (msg->perm & VHOST_ACCESS_WO) gup_flags |= FOLL_WRITE; @@ -614,61 +614,86 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, if (!npages) return -EINVAL; + page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL); + vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *), + GFP_KERNEL); + if (!page_list || !vmas) { + ret = -ENOMEM; + goto free; + } + mmap_read_lock(dev->mm); - locked = atomic64_add_return(npages, &dev->mm->pinned_vm); lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - - if (locked > lock_limit) { + if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) { ret = -ENOMEM; - goto out; + goto unlock; } - cur_base = msg->uaddr & PAGE_MASK; - iova &= PAGE_MASK; + pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags, + page_list, vmas); + if (npages != pinned) { + if (pinned < 0) { + ret = pinned; + } else { + unpin_user_pages(page_list, pinned); + ret = -ENOMEM; + } + goto unlock; + } - while (npages) { - pinned = min_t(unsigned long, npages, list_size); - ret = pin_user_pages(cur_base, pinned, -gup_flags, page_list, NULL); - if (ret != pinned) - goto out; - - if (!last_pfn) - map_pfn = page_to_pfn(page_list[0]); - - for (i = 0; i < ret; i++) { - unsigned long this_pfn = page_to_pfn(page_list[i]); - u64 csize; - - if (last_pfn && (this_pfn != last_pfn + 1)) { - /* Pin a contiguous chunk of memory */ - csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT; - if (vhost_vdpa_map(v, iova, csize, - map_pfn << PAGE_SHIFT, - msg->perm)) - goto out; - map_pfn = this_pfn; - iova += csize; + iova &= PAGE_MASK; + map_pfn = page_to_pfn(page_list[0]); + + /* One more iteration to avoid extra vdpa_map() call out of loop. */ + for (i = 0; i <= npa
[PATCH] vhost-vdpa: fix page pinning leakage in error path
Pinned pages are not properly accounted particularly when mapping error occurs on IOTLB update. Clean up dangling pinned pages for the error path. As the inflight pinned pages, specifically for memory region that strides across multiple chunks, would need more than one free page for book keeping and accounting. For simplicity, pin pages for all memory in the IOVA range in one go rather than have multiple pin_user_pages calls to make up the entire region. This way it's easier to track and account the pages already mapped, particularly for clean-up in the error path. Fixes: 20453a45fb06 ("vhost: introduce vDPA-based backend") Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 121 +++ 1 file changed, 73 insertions(+), 48 deletions(-) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 796fe97..abc4aa2 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -565,6 +565,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, perm_to_iommu_flags(perm)); } + if (r) + vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1); return r; } @@ -592,21 +594,19 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, struct vhost_dev *dev = &v->vdev; struct vhost_iotlb *iotlb = dev->iotlb; struct page **page_list; - unsigned long list_size = PAGE_SIZE / sizeof(struct page *); + struct vm_area_struct **vmas; unsigned int gup_flags = FOLL_LONGTERM; - unsigned long npages, cur_base, map_pfn, last_pfn = 0; - unsigned long locked, lock_limit, pinned, i; + unsigned long map_pfn, last_pfn = 0; + unsigned long npages, lock_limit; + unsigned long i, nmap = 0; u64 iova = msg->iova; + long pinned; int ret = 0; if (vhost_iotlb_itree_first(iotlb, msg->iova, msg->iova + msg->size - 1)) return -EEXIST; - page_list = (struct page **) __get_free_page(GFP_KERNEL); - if (!page_list) - return -ENOMEM; - if (msg->perm & VHOST_ACCESS_WO) gup_flags |= FOLL_WRITE; @@ -614,61 +614,86 @@ static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v, if (!npages) return -EINVAL; + page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL); + vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *), + GFP_KERNEL); + if (!page_list || !vmas) { + ret = -ENOMEM; + goto free; + } + mmap_read_lock(dev->mm); - locked = atomic64_add_return(npages, &dev->mm->pinned_vm); lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - - if (locked > lock_limit) { + if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) { ret = -ENOMEM; - goto out; + goto unlock; } - cur_base = msg->uaddr & PAGE_MASK; - iova &= PAGE_MASK; + pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags, + page_list, vmas); + if (npages != pinned) { + if (pinned < 0) { + ret = pinned; + } else { + unpin_user_pages(page_list, pinned); + ret = -ENOMEM; + } + goto unlock; + } - while (npages) { - pinned = min_t(unsigned long, npages, list_size); - ret = pin_user_pages(cur_base, pinned, -gup_flags, page_list, NULL); - if (ret != pinned) - goto out; - - if (!last_pfn) - map_pfn = page_to_pfn(page_list[0]); - - for (i = 0; i < ret; i++) { - unsigned long this_pfn = page_to_pfn(page_list[i]); - u64 csize; - - if (last_pfn && (this_pfn != last_pfn + 1)) { - /* Pin a contiguous chunk of memory */ - csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT; - if (vhost_vdpa_map(v, iova, csize, - map_pfn << PAGE_SHIFT, - msg->perm)) - goto out; - map_pfn = this_pfn; - iova += csize; + iova &= PAGE_MASK; + map_pfn = page_to_pfn(page_list[0]); + + /* One more iteration to avoid extra vdpa_map() call out of loop. */ + for (i = 0; i <= npages; i++) { + unsigned long this_pfn;
[PATCH] vdpa/mlx5: should keep avail_index despite device status
A VM with mlx5 vDPA has below warnings while being reset: vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11) vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11) We should allow userspace emulating the virtio device be able to get to vq's avail_index, regardless of vDPA device status. Save the index that was last seen when virtq was stopped, so that userspace doesn't complain. Signed-off-by: Si-Wei Liu --- drivers/vdpa/mlx5/net/mlx5_vnet.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 70676a6..74264e59 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -1133,15 +1133,17 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *m if (!mvq->initialized) return; - if (query_virtqueue(ndev, mvq, &attr)) { - mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n"); - return; - } if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY) return; if (modify_virtqueue(ndev, mvq, MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND)) mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n"); + + if (query_virtqueue(ndev, mvq, &attr)) { + mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n"); + return; + } + mvq->avail_idx = attr.available_index; } static void suspend_vqs(struct mlx5_vdpa_net *ndev) @@ -1411,8 +1413,14 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device *vdev, u16 idx, struct vdpa struct mlx5_virtq_attr attr; int err; - if (!mvq->initialized) - return -EAGAIN; + /* If the virtq object was destroyed, use the value saved at +* the last minute of suspend_vq. This caters for userspace +* that cares about emulating the index after vq is stopped. +*/ + if (!mvq->initialized) { + state->avail_index = mvq->avail_idx; + return 0; + } err = query_virtqueue(ndev, mvq, &attr); if (err) { -- 1.8.3.1
Re: [PATCH v4 00/11] Hyper-V: Support PAGE_SIZE larger than 4K
On Wed, Sep 16, 2020 at 11:48:06AM +0800, Boqun Feng wrote: > This patchset add the necessary changes to support guests whose page > size is larger than 4K. And the main architecture which we develop this > for is ARM64 (also it's the architecture that I use to test this > feature). > > Previous version: > v1: https://lore.kernel.org/lkml/20200721014135.84140-1-boqun.f...@gmail.com/ > v2: https://lore.kernel.org/lkml/20200902030107.33380-1-boqun.f...@gmail.com > v3: https://lore.kernel.org/lkml/20200910143455.109293-1-boqun.f...@gmail.com/ > > Changes since v3: > > * Fix a bug that ringbuffer sizes are not page-aligned when > PAGE_SIZE = 16k. Drop the Acked-by and Reviewed-by tags for > those patches accordingly. > > * Code improvement as per suggestion from Michael Kelley. > > I've done some tests with PAGE_SIZE=64k and PAGE_SIZE=16k configurations > on ARM64 guests (with Michael's patchset[1] for ARM64 Hyper-V guest > support), everything worked fine ;-) > > Looking forwards to comments and suggestions! > > Regards, > Boqun > > [1]: > https://lore.kernel.org/lkml/1598287583-71762-1-git-send-email-mikel...@microsoft.com/ > > Boqun Feng (11): > Drivers: hv: vmbus: Always use HV_HYP_PAGE_SIZE for gpadl > Drivers: hv: vmbus: Move __vmbus_open() > Drivers: hv: vmbus: Introduce types of GPADL > Drivers: hv: Use HV_HYP_PAGE in hv_synic_enable_regs() > Drivers: hv: vmbus: Move virt_to_hvpfn() to hyperv header > hv: hyperv.h: Introduce some hvpfn helper functions > hv_netvsc: Use HV_HYP_PAGE_SIZE for Hyper-V communication > Input: hyperv-keyboard: Use VMBUS_RING_SIZE() for ringbuffer sizes > HID: hyperv: Use VMBUS_RING_SIZE() for ringbuffer sizes > Driver: hv: util: Use VMBUS_RING_SIZE() for ringbuffer sizes > scsi: storvsc: Support PAGE_SIZE larger than 4K Series applied to hyperv-next. I also replaced the tabs with spaces in the commit messages of patch 8 through patch 10. Wei.
Re: [PATCH v3 08/11] Input: hyperv-keyboard: Make ringbuffer at least take two pages
On Mon, Sep 14, 2020 at 04:46:00PM +0800, Boqun Feng wrote: > On Thu, Sep 10, 2020 at 10:34:52PM +0800, Boqun Feng wrote: > > When PAGE_SIZE > HV_HYP_PAGE_SIZE, we need the ringbuffer size to be at > > least 2 * PAGE_SIZE: one page for the header and at least one page of > > the data part (because of the alignment requirement for double mapping). > > > > So make sure the ringbuffer sizes to be at least 2 * PAGE_SIZE when > > using vmbus_open() to establish the vmbus connection. > > > > Signed-off-by: Boqun Feng > > --- > > drivers/input/serio/hyperv-keyboard.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/input/serio/hyperv-keyboard.c > > b/drivers/input/serio/hyperv-keyboard.c > > index df4e9f6f4529..6ebc61e2db3f 100644 > > --- a/drivers/input/serio/hyperv-keyboard.c > > +++ b/drivers/input/serio/hyperv-keyboard.c > > @@ -75,8 +75,8 @@ struct synth_kbd_keystroke { > > > > #define HK_MAXIMUM_MESSAGE_SIZE 256 > > > > -#define KBD_VSC_SEND_RING_BUFFER_SIZE (40 * 1024) > > -#define KBD_VSC_RECV_RING_BUFFER_SIZE (40 * 1024) > > +#define KBD_VSC_SEND_RING_BUFFER_SIZE max(40 * 1024, (int)(2 * > > PAGE_SIZE)) > > +#define KBD_VSC_RECV_RING_BUFFER_SIZE max(40 * 1024, (int)(2 * > > PAGE_SIZE)) > > > > Hmm.. just realized there is a problem here, if PAGE_SIZE = 16k, then > 40 * 1024 > 2 * PAGE_SIZE, however in the ring buffer size should also > be page aligned, otherwise vmbus_open() will fail. > > I plan to modify this as > > in linux/hyperv.h: > > #define VMBUS_RING_SIZE(payload_sz) PAGE_ALIGN(sizeof(struct hv_ring_buffer) > + (playload_sz)) > > and here: > > #define KBD_VSC_SEND_RING_BUFFER_SIZE VMBUS_RING_SIZE(36 * 1024) > #define KBD_VSC_RECV_RING_BUFFER_SIZE VMBUS_RING_SIZE(36 * 1024) > > and the similar change for patch #9. OOI why do you reduce the size by 4k here? Wei. > > Thoughts? > > Regards, > Boqun > > > #define XTKBD_EMUL0 0xe0 > > #define XTKBD_EMUL1 0xe1 > > -- > > 2.28.0 > >
Re: [RFC 02/11] Drivers: hv: vmbus: Move __vmbus_open()
On Tue, Jul 21, 2020 at 09:41:26AM +0800, Boqun Feng wrote: > Pure function movement, no functional changes. The move is made, because > in a later change, __vmbus_open() will rely on some static functions > afterwards, so we sperate the move and the modification of > __vmbus_open() in two patches to make it easy to review. > > Signed-off-by: Boqun Feng Reviewed-by: Wei Liu
Re: [RFC 01/11] Drivers: hv: vmbus: Always use HV_HYP_PAGE_SIZE for gpadl
On Tue, Jul 21, 2020 at 09:41:25AM +0800, Boqun Feng wrote: > Since the hypervisor always uses 4K as its page size, the size of PFNs > used for gpadl should be HV_HYP_PAGE_SIZE rather than PAGE_SIZE, so > adjust this accordingly as the preparation for supporting 16K/64K page > size guests. It may be worth calling out there is no change on x86 because HV_HYP_PAGE_SHIFT and PAGE_SHIFT are of the same value there. Wei.
Re: [PATCH net-next] net: hyperv: Add attributes to show RX/TX indirection table
Hi Chi Thanks for your patch. A few things need to be fixed before it can be accepted upstream. On Fri, Jul 17, 2020 at 06:04:31AM +, Chi Song wrote: > The network is observed with low performance, if TX indirection table > is imbalance. But the table is in memory and set in runtime, it's > hard to know. Add them to attributes can help on troubleshooting. Missing Signed-off-by here. I assume you wrote this patch so please add Signed-off-by: Chi Song If there are other authors, please add their SoBs too. Please wrap the commit message to around 72 to 80 columns wide. I notice you only talked about TX table but in the code your also added support for RX table. I would also suggest changing the commit message a bit to: An imbalanced TX indirection table causes netvsc to have low performance. This table is created and managed during runtime. To help better diagnose performance issues caused by imbalanced tables, add device attributes to show the content of TX and RX indirection tables. Perhaps RX table causes low performance as well? If so, the above message needs further adjustment to account for that too. I will leave reviewing the code to Haiyang and Stephen. Wei. > --- > drivers/net/hyperv/netvsc_drv.c | 46 + > 1 file changed, 46 insertions(+) > > diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c > index 6267f706e8ee..cd6fe96e10c1 100644 > --- a/drivers/net/hyperv/netvsc_drv.c > +++ b/drivers/net/hyperv/netvsc_drv.c > @@ -2370,6 +2370,51 @@ static int netvsc_unregister_vf(struct net_device > *vf_netdev) > return NOTIFY_OK; > } > > +static ssize_t tx_indirection_table_show(struct device *dev, > + struct device_attribute *dev_attr, > + char *buf) > +{ > + struct net_device *ndev = to_net_dev(dev); > + struct net_device_context *ndc = netdev_priv(ndev); > + int i = 0; > + ssize_t offset = 0; > + > + for (i = 0; i < VRSS_SEND_TAB_SIZE; i++) > + offset += sprintf(buf + offset, "%u ", ndc->tx_table[i]); > + buf[offset - 1] = '\n'; > + > + return offset; > +} > +static DEVICE_ATTR_RO(tx_indirection_table); > + > +static ssize_t rx_indirection_table_show(struct device *dev, > + struct device_attribute *dev_attr, > + char *buf) > +{ > + struct net_device *ndev = to_net_dev(dev); > + struct net_device_context *ndc = netdev_priv(ndev); > + int i = 0; > + ssize_t offset = 0; > + > + for (i = 0; i < ITAB_NUM; i++) > + offset += sprintf(buf + offset, "%u ", ndc->rx_table[i]); > + buf[offset - 1] = '\n'; > + > + return offset; > +} > +static DEVICE_ATTR_RO(rx_indirection_table); > + > +static struct attribute *netvsc_dev_attrs[] = { > + &dev_attr_tx_indirection_table.attr, > + &dev_attr_rx_indirection_table.attr, > + NULL > +}; > + > +const struct attribute_group netvsc_dev_group = { > + .name = NULL, > + .attrs = netvsc_dev_attrs, > +}; > + > static int netvsc_probe(struct hv_device *dev, > const struct hv_vmbus_device_id *dev_id) > { > @@ -2410,6 +2455,7 @@ static int netvsc_probe(struct hv_device *dev, > > net->netdev_ops = &device_ops; > net->ethtool_ops = ðtool_ops; > + net->sysfs_groups[0] = &netvsc_dev_group; > SET_NETDEV_DEV(net, &dev->device); > > /* We always need headroom for rndis header */ > -- > 2.25.1
Re: [PATCH v2 3/3] hv_netvsc: Use vmbus_requestor to generate transaction IDs for VMBus hardening
On Mon, Jun 29, 2020 at 09:33:15PM +, Haiyang Zhang wrote: > > > > -Original Message- > > From: Andres Beltran > > Sent: Monday, June 29, 2020 4:02 PM > > To: KY Srinivasan ; Haiyang Zhang > > ; Stephen Hemminger ; > > wei@kernel.org > > Cc: linux-hyp...@vger.kernel.org; linux-ker...@vger.kernel.org; Michael > > Kelley ; parri.and...@gmail.com; Andres Beltran > > ; David S . Miller ; Jakub > > Kicinski ; netdev@vger.kernel.org > > Subject: [PATCH v2 3/3] hv_netvsc: Use vmbus_requestor to generate > > transaction IDs for VMBus hardening > > > > Currently, pointers to guest memory are passed to Hyper-V as > > transaction IDs in netvsc. In the face of errors or malicious > > behavior in Hyper-V, netvsc should not expose or trust the transaction > > IDs returned by Hyper-V to be valid guest memory addresses. Instead, > > use small integers generated by vmbus_requestor as requests > > (transaction) IDs. > > > > Cc: David S. Miller > > Cc: Jakub Kicinski > > Cc: netdev@vger.kernel.org > > Signed-off-by: Andres Beltran > > Reviewed-by: Haiyang Zhang > > Thanks Haiyang for reviewing. David and Jakub: This patch depends on the first patch. I intend to pick up this patch via hyperv.git. This makes life easier for all of us. Let me know if you disagree. Wei.
Re: [PATCH 0/3] Drivers: hv: vmbus: vmbus_requestor data structure
On Fri, Jun 26, 2020 at 04:48:17PM +0200, Andrea Parri wrote: > On Fri, Jun 26, 2020 at 01:42:27PM +0000, Wei Liu wrote: > > On Thu, Jun 25, 2020 at 11:37:20AM -0400, Andres Beltran wrote: > > > From: Andres Beltran (Microsoft) > > > > > > Currently, VMbus drivers use pointers into guest memory as request IDs > > > for interactions with Hyper-V. To be more robust in the face of errors > > > or malicious behavior from a compromised Hyper-V, avoid exposing > > > guest memory addresses to Hyper-V. Also avoid Hyper-V giving back a > > > bad request ID that is then treated as the address of a guest data > > > structure with no validation. Instead, encapsulate these memory > > > addresses and provide small integers as request IDs. > > > > > > The first patch creates the definitions for the data structure, provides > > > helper methods to generate new IDs and retrieve data, and > > > allocates/frees the memory needed for vmbus_requestor. > > > > > > The second and third patches make use of vmbus_requestor to send request > > > IDs to Hyper-V in storvsc and netvsc respectively. > > > > > > > Per my understanding, this new data structure is per-channel, so it > > won't introduce contention on the lock in multi-queue scenario. Have you > > done any testing to confirm there is no severe performance regression? > > I did run some performance tests using our dev pipeline (storage and > network workloads). I did not find regressions w.r.t. baseline. Thanks, that's good to hear. Wei. > > Andrea
Re: [PATCH 0/3] Drivers: hv: vmbus: vmbus_requestor data structure
On Thu, Jun 25, 2020 at 11:37:20AM -0400, Andres Beltran wrote: > From: Andres Beltran (Microsoft) > > Currently, VMbus drivers use pointers into guest memory as request IDs > for interactions with Hyper-V. To be more robust in the face of errors > or malicious behavior from a compromised Hyper-V, avoid exposing > guest memory addresses to Hyper-V. Also avoid Hyper-V giving back a > bad request ID that is then treated as the address of a guest data > structure with no validation. Instead, encapsulate these memory > addresses and provide small integers as request IDs. > > The first patch creates the definitions for the data structure, provides > helper methods to generate new IDs and retrieve data, and > allocates/frees the memory needed for vmbus_requestor. > > The second and third patches make use of vmbus_requestor to send request > IDs to Hyper-V in storvsc and netvsc respectively. > Per my understanding, this new data structure is per-channel, so it won't introduce contention on the lock in multi-queue scenario. Have you done any testing to confirm there is no severe performance regression? Wei. > Thanks. > Andres Beltran > > Cc: linux-s...@vger.kernel.org > Cc: netdev@vger.kernel.org > Cc: "James E.J. Bottomley" > Cc: "Martin K. Petersen" > Cc: "David S. Miller" > Cc: Jakub Kicinski > > Andres Beltran (3): > Drivers: hv: vmbus: Add vmbus_requestor data structure for VMBus > hardening > scsi: storvsc: Use vmbus_requestor to generate transaction ids for > VMBus hardening > hv_netvsc: Use vmbus_requestor to generate transaction ids for VMBus > hardening > > drivers/hv/channel.c | 150 ++ > drivers/net/hyperv/hyperv_net.h | 10 ++ > drivers/net/hyperv/netvsc.c | 56 +-- > drivers/net/hyperv/rndis_filter.c | 1 + > drivers/scsi/storvsc_drv.c| 62 ++-- > include/linux/hyperv.h| 22 + > 6 files changed, 283 insertions(+), 18 deletions(-) > > -- > 2.25.1 >
Re: [PATCH v2] hv_netvsc: Fix netvsc_start_xmit's return type
David Do you want this to go through net tree? I can submit it via hyperv tree if that's preferred. Wei. On Tue, Apr 28, 2020 at 10:54:56AM -0700, Nathan Chancellor wrote: > netvsc_start_xmit is used as a callback function for the ndo_start_xmit > function pointer. ndo_start_xmit's return type is netdev_tx_t but > netvsc_start_xmit's return type is int. > > This causes a failure with Control Flow Integrity (CFI), which requires > function pointer prototypes and callback function definitions to match > exactly. When CFI is in enforcing, the kernel panics. When booting a > CFI kernel with WSL 2, the VM is immediately terminated because of this. > > The splat when CONFIG_CFI_PERMISSIVE is used: > > [5.916765] CFI failure (target: netvsc_start_xmit+0x0/0x10): > [5.916771] WARNING: CPU: 8 PID: 0 at kernel/cfi.c:29 > __cfi_check_fail+0x2e/0x40 > [5.916772] Modules linked in: > [5.916774] CPU: 8 PID: 0 Comm: swapper/8 Not tainted > 5.7.0-rc3-next-20200424-microsoft-cbl-1-ged4eb37d2c69-dirty #1 > [5.916776] RIP: 0010:__cfi_check_fail+0x2e/0x40 > [5.916777] Code: 48 c7 c7 70 98 63 a9 48 c7 c6 11 db 47 a9 e8 69 55 59 00 > 85 c0 75 02 5b c3 48 c7 c7 73 c6 43 a9 48 89 de 31 c0 e8 12 2d f0 ff <0f> 0b > 5b c3 00 00 cc cc 00 00 cc cc 00 00 cc cc 00 00 85 f6 74 25 > [5.916778] RSP: 0018:a803c0260b78 EFLAGS: 00010246 > [5.916779] RAX: 712a1af25779e900 RBX: a8cf7950 RCX: > a962cf08 > [5.916779] RDX: a9c36b60 RSI: 0082 RDI: > a9c36b5c > [5.916780] RBP: 8ffc4779c2c0 R08: 0001 R09: > a9c3c300 > [5.916781] R10: 0151 R11: a9c36b60 R12: > 8ffe39084000 > [5.916782] R13: a8cf7950 R14: a8d12cb0 R15: > 8ffe39320140 > [5.916784] FS: () GS:8ffe3bc0() > knlGS: > [5.916785] CS: 0010 DS: ES: CR0: 80050033 > [5.916786] CR2: 7ffef5749408 CR3: 0002f4f5e000 CR4: > 00340ea0 > [5.916787] Call Trace: > [5.916788] > [5.916790] __cfi_check+0x3ab58/0x450e0 > [5.916793] ? dev_hard_start_xmit+0x11f/0x160 > [5.916795] ? sch_direct_xmit+0xf2/0x230 > [5.916796] ? __dev_queue_xmit.llvm.11471227737707190958+0x69d/0x8e0 > [5.916797] ? neigh_resolve_output+0xdf/0x220 > [5.916799] ? neigh_connected_output.cfi_jt+0x8/0x8 > [5.916801] ? ip6_finish_output2+0x398/0x4c0 > [5.916803] ? nf_nat_ipv6_out+0x10/0xa0 > [5.916804] ? nf_hook_slow+0x84/0x100 > [5.916807] ? ip6_input_finish+0x8/0x8 > [5.916807] ? ip6_output+0x6f/0x110 > [5.916808] ? __ip6_local_out.cfi_jt+0x8/0x8 > [5.916810] ? mld_sendpack+0x28e/0x330 > [5.916811] ? ip_rt_bug+0x8/0x8 > [5.916813] ? mld_ifc_timer_expire+0x2db/0x400 > [5.916814] ? neigh_proxy_process+0x8/0x8 > [5.916816] ? call_timer_fn+0x3d/0xd0 > [5.916817] ? __run_timers+0x2a9/0x300 > [5.916819] ? rcu_core_si+0x8/0x8 > [5.916820] ? run_timer_softirq+0x14/0x30 > [5.916821] ? __do_softirq+0x154/0x262 > [5.916822] ? native_x2apic_icr_write+0x8/0x8 > [5.916824] ? irq_exit+0xba/0xc0 > [5.916825] ? hv_stimer0_vector_handler+0x99/0xe0 > [5.916826] ? hv_stimer0_callback_vector+0xf/0x20 > [5.916826] > [5.916828] ? hv_stimer_global_cleanup.cfi_jt+0x8/0x8 > [5.916829] ? raw_setsockopt+0x8/0x8 > [5.916830] ? default_idle+0xe/0x10 > [5.916832] ? do_idle.llvm.10446269078108580492+0xb7/0x130 > [5.916833] ? raw_setsockopt+0x8/0x8 > [5.916833] ? cpu_startup_entry+0x15/0x20 > [5.916835] ? cpu_hotplug_enable.cfi_jt+0x8/0x8 > [5.916836] ? start_secondary+0x188/0x190 > [5.916837] ? secondary_startup_64+0xa5/0xb0 > [5.916838] ---[ end trace f2683fa869597ba5 ]--- > > Avoid this by using the right return type for netvsc_start_xmit. > > Fixes: fceaf24a943d8 ("Staging: hv: add the Hyper-V virtual network driver") > Link: https://github.com/ClangBuiltLinux/linux/issues/1009 > Signed-off-by: Nathan Chancellor > --- > > v1 -> v2: > > * Move splat into commit message rather than issue. > > Comment from previous version: > > Do note that netvsc_xmit still returns int because netvsc_xmit has a > potential return from netvsc_vf_xmit, which does not return netdev_tx_t > because of the call to dev_queue_xmit. > > I am not sure if that is an oversight that was introduced by > commit 0c195567a8f6e ("netvsc: transparent VF management") or if > everything works properly as it is now. > > My patch is purely concerned with making the definition match the > prototype so it should be NFC aside from avoiding the CFI panic. > > drivers/net/hyperv/netvsc_drv.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c > index d8e86bdbfba1e..ebcfbae056900 100644 > --- a/drivers/net/hyperv/netvsc_drv.c > +++ b/drivers/net/hyperv/ne
Re: [PATCH] hv_netvsc: Fix netvsc_start_xmit's return type
On Mon, Apr 27, 2020 at 08:30:43PM -0700, Nathan Chancellor wrote: > netvsc_start_xmit is used as a callback function for the ndo_start_xmit > function pointer. ndo_start_xmit's return type is netdev_tx_t but > netvsc_start_xmit's return type is int. > > This causes a failure with Control Flow Integrity (CFI), which requires > function pointer prototypes and callback function definitions to match > exactly. When CFI is in enforcing, the kernel panics. When booting a > CFI kernel with WSL 2, the VM is immediately terminated because of this: > > $ wsl.exe -d ubuntu > The Windows Subsystem for Linux instance has terminated. > > Avoid this by using the right return type for netvsc_start_xmit. > > Fixes: fceaf24a943d8 ("Staging: hv: add the Hyper-V virtual network driver") > Link: https://github.com/ClangBuiltLinux/linux/issues/1009 Please consider pulling in the panic log from #1009 to the commit message. It is much better than the one line message above. > Signed-off-by: Nathan Chancellor > --- > > Do note that netvsc_xmit still returns int because netvsc_xmit has a > potential return from netvsc_vf_xmit, which does not return netdev_tx_t > because of the call to dev_queue_xmit. > > I am not sure if that is an oversight that was introduced by > commit 0c195567a8f6e ("netvsc: transparent VF management") or if > everything works properly as it is now. > > My patch is purely concerned with making the definition match the > prototype so it should be NFC aside from avoiding the CFI panic. > > drivers/net/hyperv/netvsc_drv.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c > index d8e86bdbfba1e..ebcfbae056900 100644 > --- a/drivers/net/hyperv/netvsc_drv.c > +++ b/drivers/net/hyperv/netvsc_drv.c > @@ -707,7 +707,8 @@ static int netvsc_xmit(struct sk_buff *skb, struct > net_device *net, bool xdp_tx) > goto drop; > } > > -static int netvsc_start_xmit(struct sk_buff *skb, struct net_device *ndev) > +static netdev_tx_t netvsc_start_xmit(struct sk_buff *skb, > + struct net_device *ndev) > { > return netvsc_xmit(skb, ndev, false); > } > > base-commit: 51184ae37e0518fd90cb437a2fbc953ae558cd0d > -- > 2.26.2 >
Re: [PATCH net-next] MAINTAINERS: xen-netback: update my email address
On Fri, 13 Sep 2019 at 16:28, Wei Liu wrote: > > On Fri, 13 Sep 2019 at 13:47, Paul Durrant wrote: > > > > My Citrix email address will expire shortly. > > > > Signed-off-by: Paul Durrant > > Acked-by: Wei Liu Or rather: Acked-by: Wei Liu
Re: [PATCH net-next] MAINTAINERS: xen-netback: update my email address
On Fri, 13 Sep 2019 at 13:47, Paul Durrant wrote: > > My Citrix email address will expire shortly. > > Signed-off-by: Paul Durrant Acked-by: Wei Liu
Re: [PATCH] xen-netback: no need to check return value of debugfs_create functions
On Sat, Aug 10, 2019 at 12:31:08PM +0200, Greg Kroah-Hartman wrote: > When calling debugfs functions, there is no need to ever check the > return value. The function can work or not, but the code logic should > never do something different based on this. > > Cc: Wei Liu > Cc: Paul Durrant > Cc: xen-de...@lists.xenproject.org > Cc: netdev@vger.kernel.org > Signed-off-by: Greg Kroah-Hartman Acked-by: Wei Liu
[PATCH net-next] Update my email address
Signed-off-by: Wei Liu --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 0c55b0fedbe2..e212c6a42ddf 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17295,7 +17295,7 @@ F: Documentation/ABI/stable/sysfs-hypervisor-xen F: Documentation/ABI/testing/sysfs-hypervisor-xen XEN NETWORK BACKEND DRIVER -M: Wei Liu +M: Wei Liu M: Paul Durrant L: xen-de...@lists.xenproject.org (moderated for non-subscribers) L: netdev@vger.kernel.org -- 2.20.1
Re: [PATCH net v8] failover: allow name change on IFF_UP slave interfaces
On 4/9/2019 9:13 AM, Michael S. Tsirkin wrote: On Mon, Apr 08, 2019 at 07:45:27PM -0400, Si-Wei Liu wrote: When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. It's less risky to lift up the rename restriction on failover slave which is already UP. Although it's possible this change may potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to listen for the rename events on failover slaves. Userspace component interacting with slaves is expected to be changed to operate on failover master interface instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and userspace components should only deal with failover master in the long run. Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module") Signed-off-by: Si-Wei Liu Reviewed-by: Liran Alon OK let's start with that. We can always add more events. Yep. Nothing is impossible but this is the least we should do. We can add more events based on *real use case*. -Siwei -- v1 -> v2: - Drop configurable module parameter (Sridhar) v2 -> v3: - Drop additional IFF_SLAVE_RENAME_OK flag (Sridhar) - Send down and up events around rename (Michael S. Tsirkin) v3 -> v4: - Simplify notification to be sent (Stephen Hemminger) v4 -> v5: - Sync up code with latest net-next (Sridhar) - Use proper structure initialization (Stephen, Jiri) v5 -> v6: - Make the property of live name change a generic flag (Stephen) v6 -> v7: - Remove NETDEV_CHANGE notification that is deemed unnecessary (Stephen) v7 -> v8: - Fix commit message that references link up/down events still (Michael) --- include/linux/netdevice.h | 3 +++ net/core/dev.c| 16 +++- net/core/failover.c | 6 +++--- 3 files changed, 21 insertions(+), 4 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 78f5ec4e..ea9a63f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1498,6 +1498,7 @@ struct net_device_ops { * @IFF_FAILOVER: device is a failover master device * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device + * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running */ enum netdev_priv_flags { IFF_802_1Q_VLAN = 1<<0, @@ -1530,6 +1531,7 @@ enum netdev_priv_flags { IFF_FAILOVER= 1<<27, IFF_FAILOVER_SLAVE = 1<<28, IFF_L3MDEV_RX_HANDLER = 1<<29, + IFF_LIVE_RENAME_OK = 1<<30, }; #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN @@ -1561,6 +1563,7 @@ enum netdev_priv_flags { #define IFF_FAILOVER IFF_FAILOVER #define IFF_FAILOVER_SLAVEIFF_FAILOVER_SLAVE #d
Re: [PATCH net v2] failover: allow name change on IFF_UP slave interfaces
On 3/6/2019 8:13 PM, Samudrala, Sridhar wrote: On 3/6/2019 7:08 PM, Si-Wei Liu wrote: When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. It's less risky to lift up the rename restriction on failover slave which is already UP. Although it's possible this change may potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to work with the new naming behavior of failover slaves. Userspace component interacting with slaves should be changed to operate on failover master instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and all userspace should only deal with master in the long run. Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module") Signed-off-by: Si-Wei Liu Reviewed-by: Liran Alon Acked-by: Michael S. Tsirkin --- v1 -> v2: - Drop configurable module parameter (Sridhar) include/linux/netdevice.h | 3 +++ net/core/dev.c| 3 ++- net/core/failover.c | 6 +++--- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 857f8ab..6d9e4e0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1487,6 +1487,7 @@ struct net_device_ops { * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook * @IFF_FAILOVER: device is a failover master device * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device + * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running */ enum netdev_priv_flags { IFF_802_1Q_VLAN= 1<<0, @@ -1518,6 +1519,7 @@ enum netdev_priv_flags { IFF_NO_RX_HANDLER= 1<<26, IFF_FAILOVER= 1<<27, IFF_FAILOVER_SLAVE= 1<<28, +IFF_SLAVE_RENAME_OK= 1<<29, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1548,6 +1550,7 @@ enum netdev_priv_flags { #define IFF_NO_RX_HANDLERIFF_NO_RX_HANDLER #define IFF_FAILOVERIFF_FAILOVER #define IFF_FAILOVER_SLAVEIFF_FAILOVER_SLAVE +#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK /** *struct net_device - The DEVICE structure. diff --git a/net/core/dev.c b/net/core/dev.c index 722d50d..ae070de 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char *newname) BUG_ON(!dev_net(dev)); net = dev_net(dev); -if (dev->flags & IFF_UP) +if (dev->flags & IFF_UP && +!(dev->priv_flags & IFF_SLAVE_RENAME_OK)) return -EBUSY; Without the configurable module parameter, i think we don't even need the new SLAVE_RENAME_OK private flag. Can't we simply check for IFF_FAILOVER_SLAVE ? I'd prefer keeping this flag for now, even though without configurab
[PATCH net v2] failover: allow name change on IFF_UP slave interfaces
When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. It's less risky to lift up the rename restriction on failover slave which is already UP. Although it's possible this change may potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to work with the new naming behavior of failover slaves. Userspace component interacting with slaves should be changed to operate on failover master instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and all userspace should only deal with master in the long run. Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module") Signed-off-by: Si-Wei Liu Reviewed-by: Liran Alon Acked-by: Michael S. Tsirkin --- v1 -> v2: - Drop configurable module parameter (Sridhar) include/linux/netdevice.h | 3 +++ net/core/dev.c| 3 ++- net/core/failover.c | 6 +++--- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 857f8ab..6d9e4e0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1487,6 +1487,7 @@ struct net_device_ops { * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook * @IFF_FAILOVER: device is a failover master device * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device + * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running */ enum netdev_priv_flags { IFF_802_1Q_VLAN = 1<<0, @@ -1518,6 +1519,7 @@ enum netdev_priv_flags { IFF_NO_RX_HANDLER = 1<<26, IFF_FAILOVER= 1<<27, IFF_FAILOVER_SLAVE = 1<<28, + IFF_SLAVE_RENAME_OK = 1<<29, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1548,6 +1550,7 @@ enum netdev_priv_flags { #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER #define IFF_FAILOVER IFF_FAILOVER #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE +#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK /** * struct net_device - The DEVICE structure. diff --git a/net/core/dev.c b/net/core/dev.c index 722d50d..ae070de 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char *newname) BUG_ON(!dev_net(dev)); net = dev_net(dev); - if (dev->flags & IFF_UP) + if (dev->flags & IFF_UP && + !(dev->priv_flags & IFF_SLAVE_RENAME_OK)) return -EBUSY; write_seqcount_begin(&devnet_rename_seq); diff --git a/net/core/failover.c b/net/core/failover.c index 4a92a98..34c5c87 100644 --- a/net/core/failover.c +++ b/net/core/failover.c @@ -80,14 +80,14 @@ static int failover_slave_register(struct
Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces
On 3/5/2019 11:23 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 11:15:06PM -0800, si-wei liu wrote: On 3/5/2019 10:43 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 04:51:00PM -0800, si-wei liu wrote: On 3/5/2019 4:36 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 04:20:50PM -0800, si-wei liu wrote: On 3/5/2019 4:06 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 11:35:50AM -0800, si-wei liu wrote: On 3/5/2019 11:24 AM, Stephen Hemminger wrote: On Tue, 5 Mar 2019 11:19:32 -0800 si-wei liu wrote: I have a vague idea: would it work to *not* set IFF_UP on slave devices at all? Hmm, I ever thought about this option, and it appears this solution is more invasive than required to convert existing scripts, despite the controversy of introducing internal netdev state to differentiate user visible state. Either we disallow slave to be brought up by user, or to not set IFF_UP flag but instead use the internal one, could end up with substantial behavioral change that breaks scripts. Consider any admin script that does `ip link set dev ... up' successfully just assumes the link is up and subsequent operation can be done as usual. How would it work when carrier is off? While it *may* work for dracut (yet to be verified), I'm a bit concerned that there are more scripts to be converted than those that don't follow volatile failover slave names. It's technically doable, but may not worth the effort (in terms of porting existing scripts/apps). Thanks -Siwei Won't work for most devices. Many devices turn off PHY and link layer if not IFF_UP True, that's what I said about introducing internal state for those driver and other kernel component. Very invasive change indeed. -Siwei Well I did say it's vague. How about hiding IFF_UP from dev_get_flags (and probably __dev_change_flags)? Any different? This has small footprint for the kernel change for sure, while the discrepancy is still there. Anyone who writes code for IFF_UP will not notice IFF_FAILOVER_SLAVE. Not to mention more userspace "fixup" work has to be done due to this change. -Siwei Point is it's ok since most userspace should just ignore slaves - hopefully it will just ignore it since it already ignores interfaces that are down. Admin script thought the interface could be bright up and do further operations without checking the UP flag. These scripts then would be broken on any box with multiple interfaces since not all of these would have carrier. Consider a script executing `ifconfig ... up' and once succeeds runs tcpdump or some other command relying on UP interface. It's quite common that those scripts don't check the UP flag but instead just rely on the well-known fact that the command exits with 0 meaning the interface should be UP. This change might well break scripts of that kind. I am sorry I don't get it. Could you give an example of a script that works now but would be broken? https://github.com/torvalds/linux/blob/master/tools/testing/selftests/net/netdevice.sh#L27 https://github.com/WPO-Foundation/wptagent/blob/master/internal/adb.py#L443 https://github.com/openstack/steth/blob/master/steth/agent/api.py#L134 There are more if you keep searching. -Siwei It doesn't look to be a reliable way of prohibit userspace from operating against slaves. -Siwei This does not mean we shouldn't make an effort to disable broken configurations. I am not arguing against your patch. Not at all. I see better hiding of slaves as a separate enhancement. I understand, but my point is we should try to minimize unnecessary side impact to the current usage for whatever "hiding" effort we can make. It's hard to find a tradeoff sometimes. Yes if some userspace made an assumption and it worked, we should keep it working I think. I don't necessarily agree we should worry too much about theoretical issues. In half a year since the feature got merged it's unlikely there are millions of slightly different scripts using it. Acked-by: Michael S. Tsirkin Thank you. -Siwei
Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces
On 3/5/2019 10:43 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 04:51:00PM -0800, si-wei liu wrote: On 3/5/2019 4:36 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 04:20:50PM -0800, si-wei liu wrote: On 3/5/2019 4:06 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 11:35:50AM -0800, si-wei liu wrote: On 3/5/2019 11:24 AM, Stephen Hemminger wrote: On Tue, 5 Mar 2019 11:19:32 -0800 si-wei liu wrote: I have a vague idea: would it work to *not* set IFF_UP on slave devices at all? Hmm, I ever thought about this option, and it appears this solution is more invasive than required to convert existing scripts, despite the controversy of introducing internal netdev state to differentiate user visible state. Either we disallow slave to be brought up by user, or to not set IFF_UP flag but instead use the internal one, could end up with substantial behavioral change that breaks scripts. Consider any admin script that does `ip link set dev ... up' successfully just assumes the link is up and subsequent operation can be done as usual. How would it work when carrier is off? While it *may* work for dracut (yet to be verified), I'm a bit concerned that there are more scripts to be converted than those that don't follow volatile failover slave names. It's technically doable, but may not worth the effort (in terms of porting existing scripts/apps). Thanks -Siwei Won't work for most devices. Many devices turn off PHY and link layer if not IFF_UP True, that's what I said about introducing internal state for those driver and other kernel component. Very invasive change indeed. -Siwei Well I did say it's vague. How about hiding IFF_UP from dev_get_flags (and probably __dev_change_flags)? Any different? This has small footprint for the kernel change for sure, while the discrepancy is still there. Anyone who writes code for IFF_UP will not notice IFF_FAILOVER_SLAVE. Not to mention more userspace "fixup" work has to be done due to this change. -Siwei Point is it's ok since most userspace should just ignore slaves - hopefully it will just ignore it since it already ignores interfaces that are down. Admin script thought the interface could be bright up and do further operations without checking the UP flag. These scripts then would be broken on any box with multiple interfaces since not all of these would have carrier. Consider a script executing `ifconfig ... up' and once succeeds runs tcpdump or some other command relying on UP interface. It's quite common that those scripts don't check the UP flag but instead just rely on the well-known fact that the command exits with 0 meaning the interface should be UP. This change might well break scripts of that kind. It doesn't look to be a reliable way of prohibit userspace from operating against slaves. -Siwei This does not mean we shouldn't make an effort to disable broken configurations. I am not arguing against your patch. Not at all. I see better hiding of slaves as a separate enhancement. I understand, but my point is we should try to minimize unnecessary side impact to the current usage for whatever "hiding" effort we can make. It's hard to find a tradeoff sometimes. Acked-by: Michael S. Tsirkin Thank you. -Siwei
Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces
On 3/5/2019 4:36 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 04:20:50PM -0800, si-wei liu wrote: On 3/5/2019 4:06 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 11:35:50AM -0800, si-wei liu wrote: On 3/5/2019 11:24 AM, Stephen Hemminger wrote: On Tue, 5 Mar 2019 11:19:32 -0800 si-wei liu wrote: I have a vague idea: would it work to *not* set IFF_UP on slave devices at all? Hmm, I ever thought about this option, and it appears this solution is more invasive than required to convert existing scripts, despite the controversy of introducing internal netdev state to differentiate user visible state. Either we disallow slave to be brought up by user, or to not set IFF_UP flag but instead use the internal one, could end up with substantial behavioral change that breaks scripts. Consider any admin script that does `ip link set dev ... up' successfully just assumes the link is up and subsequent operation can be done as usual. How would it work when carrier is off? While it *may* work for dracut (yet to be verified), I'm a bit concerned that there are more scripts to be converted than those that don't follow volatile failover slave names. It's technically doable, but may not worth the effort (in terms of porting existing scripts/apps). Thanks -Siwei Won't work for most devices. Many devices turn off PHY and link layer if not IFF_UP True, that's what I said about introducing internal state for those driver and other kernel component. Very invasive change indeed. -Siwei Well I did say it's vague. How about hiding IFF_UP from dev_get_flags (and probably __dev_change_flags)? Any different? This has small footprint for the kernel change for sure, while the discrepancy is still there. Anyone who writes code for IFF_UP will not notice IFF_FAILOVER_SLAVE. Not to mention more userspace "fixup" work has to be done due to this change. -Siwei Point is it's ok since most userspace should just ignore slaves - hopefully it will just ignore it since it already ignores interfaces that are down. Admin script thought the interface could be bright up and do further operations without checking the UP flag. It doesn't look to be a reliable way of prohibit userspace from operating against slaves. -Siwei
Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces
On 3/5/2019 4:06 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 11:35:50AM -0800, si-wei liu wrote: On 3/5/2019 11:24 AM, Stephen Hemminger wrote: On Tue, 5 Mar 2019 11:19:32 -0800 si-wei liu wrote: I have a vague idea: would it work to *not* set IFF_UP on slave devices at all? Hmm, I ever thought about this option, and it appears this solution is more invasive than required to convert existing scripts, despite the controversy of introducing internal netdev state to differentiate user visible state. Either we disallow slave to be brought up by user, or to not set IFF_UP flag but instead use the internal one, could end up with substantial behavioral change that breaks scripts. Consider any admin script that does `ip link set dev ... up' successfully just assumes the link is up and subsequent operation can be done as usual. How would it work when carrier is off? While it *may* work for dracut (yet to be verified), I'm a bit concerned that there are more scripts to be converted than those that don't follow volatile failover slave names. It's technically doable, but may not worth the effort (in terms of porting existing scripts/apps). Thanks -Siwei Won't work for most devices. Many devices turn off PHY and link layer if not IFF_UP True, that's what I said about introducing internal state for those driver and other kernel component. Very invasive change indeed. -Siwei Well I did say it's vague. How about hiding IFF_UP from dev_get_flags (and probably __dev_change_flags)? Any different? This has small footprint for the kernel change for sure, while the discrepancy is still there. Anyone who writes code for IFF_UP will not notice IFF_FAILOVER_SLAVE. Not to mention more userspace "fixup" work has to be done due to this change. -Siwei
Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces
On 3/5/2019 12:28 PM, Michael S. Tsirkin wrote: On Tue, Mar 05, 2019 at 11:19:32AM -0800, si-wei liu wrote: On 3/4/2019 6:33 PM, Michael S. Tsirkin wrote: On Mon, Mar 04, 2019 at 07:50:59PM -0500, Si-Wei Liu wrote: When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. For that to work, now introduce a module-level tunable, "slave_rename_ok" that allows users to lift up the rename restriction on failover slave which is already UP. Although it's possible this change potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to work with the new naming behavior of the failover slave. Userspace component interacting with slaves should be changed to operate on failover master instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and all userspace should only deal with master in the long run. The default for the "slave_rename_ok" is set to true(1). If userspace doesn't have the right support in place meanwhile users don't care about reliable userspace naming, the value can be set to false(0). Signed-off-by: si-wei@oracle.com Reviewed-by: Liran Alon Not sure which of the versions I should reply to. Sorry for multiple copies sent. It's fine to reply to this one. I have a vague idea: would it work to *not* set IFF_UP on slave devices at all? Hmm, I ever thought about this option, and it appears this solution is more invasive than required to convert existing scripts, despite the controversy of introducing internal netdev state to differentiate user visible state. Either we disallow slave to be brought up by user, or to not set IFF_UP flag but instead use the internal one, could end up with substantial behavioral change that breaks scripts. Consider any admin script that does `ip link set dev ... up' successfully just assumes the link is up and subsequent operation can be done as usual. While it *may* work for dracut (yet to be verified), I'm a bit concerned that there are more scripts to be converted than those that don't follow volatile failover slave names. It's technically doable, but may not worth the effort (in terms of porting existing scripts/apps). Thanks -Siwei Right. Advantage could be that we prevent all kind of misconfigurations e.g. when one has a route on a slave. The fix for the slave route problem is already there in dracut. The ship has sailed, no matter how seamless upstream thought failover could work with the existing userspace. I would rather avoid introducing more breakage to userspace if there's simple yet less intrusive way to fix the rename issue itself. -Siwei Would this reduce the chances of existing scripts such as dracut being confused? And this leaves open the option for scripts to address slaves by checking some custom attribute. --- include/linux/netdevice.h | 3 +++ net/core/
Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces
On 3/5/2019 11:24 AM, Stephen Hemminger wrote: On Tue, 5 Mar 2019 11:19:32 -0800 si-wei liu wrote: I have a vague idea: would it work to *not* set IFF_UP on slave devices at all? Hmm, I ever thought about this option, and it appears this solution is more invasive than required to convert existing scripts, despite the controversy of introducing internal netdev state to differentiate user visible state. Either we disallow slave to be brought up by user, or to not set IFF_UP flag but instead use the internal one, could end up with substantial behavioral change that breaks scripts. Consider any admin script that does `ip link set dev ... up' successfully just assumes the link is up and subsequent operation can be done as usual. While it *may* work for dracut (yet to be verified), I'm a bit concerned that there are more scripts to be converted than those that don't follow volatile failover slave names. It's technically doable, but may not worth the effort (in terms of porting existing scripts/apps). Thanks -Siwei Won't work for most devices. Many devices turn off PHY and link layer if not IFF_UP True, that's what I said about introducing internal state for those driver and other kernel component. Very invasive change indeed. -Siwei
Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces
On 3/4/2019 6:33 PM, Michael S. Tsirkin wrote: On Mon, Mar 04, 2019 at 07:50:59PM -0500, Si-Wei Liu wrote: When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. For that to work, now introduce a module-level tunable, "slave_rename_ok" that allows users to lift up the rename restriction on failover slave which is already UP. Although it's possible this change potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to work with the new naming behavior of the failover slave. Userspace component interacting with slaves should be changed to operate on failover master instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and all userspace should only deal with master in the long run. The default for the "slave_rename_ok" is set to true(1). If userspace doesn't have the right support in place meanwhile users don't care about reliable userspace naming, the value can be set to false(0). Signed-off-by: si-wei@oracle.com Reviewed-by: Liran Alon Not sure which of the versions I should reply to. Sorry for multiple copies sent. It's fine to reply to this one. I have a vague idea: would it work to *not* set IFF_UP on slave devices at all? Hmm, I ever thought about this option, and it appears this solution is more invasive than required to convert existing scripts, despite the controversy of introducing internal netdev state to differentiate user visible state. Either we disallow slave to be brought up by user, or to not set IFF_UP flag but instead use the internal one, could end up with substantial behavioral change that breaks scripts. Consider any admin script that does `ip link set dev ... up' successfully just assumes the link is up and subsequent operation can be done as usual. While it *may* work for dracut (yet to be verified), I'm a bit concerned that there are more scripts to be converted than those that don't follow volatile failover slave names. It's technically doable, but may not worth the effort (in terms of porting existing scripts/apps). Thanks -Siwei Would this reduce the chances of existing scripts such as dracut being confused? And this leaves open the option for scripts to address slaves by checking some custom attribute. --- include/linux/netdevice.h | 3 +++ net/core/dev.c| 3 ++- net/core/failover.c | 11 +-- 3 files changed, 14 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 857f8ab..6d9e4e0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1487,6 +1487,7 @@ struct net_device_ops { * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook * @IFF_FAILOVER: device is a failover master device * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
Re: [RFC PATCH net] failover: allow name change on IFF_UP slave interfaces
Sorry for multiple sends. The patch is exactly same. I added a few people who were missing int the cc lines in the first attemt. And corrected the subject line in the second attempt. -Siwei On 3/4/2019 6:04 PM, David Miller wrote: Why did you send this three times? What's different in each of these copies?
Re: [RFC PATCH net] failover: allow name change on IFF_UP slave interfaces
Please disregard patch emails previously sent with similar subject. The patch target is set to net rather than net-next. Any discussion or comment around this patch should go after this email. -Siwei On 3/4/2019 4:53 PM, Si-Wei Liu wrote: When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. For that to work, now introduce a module-level tunable, "slave_rename_ok" that allows users to lift up the rename restriction on failover slave which is already UP. Although it's possible this change potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to work with the new naming behavior of the failover slave. Userspace component interacting with slaves should be changed to operate on failover master instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and all userspace should only deal with master in the long run. The default for the "slave_rename_ok" is set to true(1). If userspace doesn't have the right support in place meanwhile users don't care about reliable userspace naming, the value can be set to false(0). Signed-off-by: si-wei@oracle.com Reviewed-by: Liran Alon --- include/linux/netdevice.h | 3 +++ net/core/dev.c| 3 ++- net/core/failover.c | 11 +-- 3 files changed, 14 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 857f8ab..6d9e4e0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1487,6 +1487,7 @@ struct net_device_ops { * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook * @IFF_FAILOVER: device is a failover master device * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device + * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running */ enum netdev_priv_flags { IFF_802_1Q_VLAN = 1<<0, @@ -1518,6 +1519,7 @@ enum netdev_priv_flags { IFF_NO_RX_HANDLER = 1<<26, IFF_FAILOVER= 1<<27, IFF_FAILOVER_SLAVE = 1<<28, + IFF_SLAVE_RENAME_OK = 1<<29, }; #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN @@ -1548,6 +1550,7 @@ enum netdev_priv_flags { #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER #define IFF_FAILOVER IFF_FAILOVER #define IFF_FAILOVER_SLAVEIFF_FAILOVER_SLAVE +#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK /** *struct net_device - The DEVICE structure. diff --git a/net/core/dev.c b/net/core/dev.c index 722d50d..ae070de 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char *newname) BUG_ON(!dev_net(dev)); net = dev_net(dev); - if (dev->flags & IFF_UP) +
[RFC PATCH net] failover: allow name change on IFF_UP slave interfaces
When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. For that to work, now introduce a module-level tunable, "slave_rename_ok" that allows users to lift up the rename restriction on failover slave which is already UP. Although it's possible this change potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to work with the new naming behavior of the failover slave. Userspace component interacting with slaves should be changed to operate on failover master instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and all userspace should only deal with master in the long run. The default for the "slave_rename_ok" is set to true(1). If userspace doesn't have the right support in place meanwhile users don't care about reliable userspace naming, the value can be set to false(0). Signed-off-by: si-wei@oracle.com Reviewed-by: Liran Alon --- include/linux/netdevice.h | 3 +++ net/core/dev.c| 3 ++- net/core/failover.c | 11 +-- 3 files changed, 14 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 857f8ab..6d9e4e0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1487,6 +1487,7 @@ struct net_device_ops { * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook * @IFF_FAILOVER: device is a failover master device * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device + * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running */ enum netdev_priv_flags { IFF_802_1Q_VLAN = 1<<0, @@ -1518,6 +1519,7 @@ enum netdev_priv_flags { IFF_NO_RX_HANDLER = 1<<26, IFF_FAILOVER= 1<<27, IFF_FAILOVER_SLAVE = 1<<28, + IFF_SLAVE_RENAME_OK = 1<<29, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1548,6 +1550,7 @@ enum netdev_priv_flags { #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER #define IFF_FAILOVER IFF_FAILOVER #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE +#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK /** * struct net_device - The DEVICE structure. diff --git a/net/core/dev.c b/net/core/dev.c index 722d50d..ae070de 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char *newname) BUG_ON(!dev_net(dev)); net = dev_net(dev); - if (dev->flags & IFF_UP) + if (dev->flags & IFF_UP && + !(dev->priv_flags & IFF_SLAVE_RENAME_OK)) return -EBUSY; write_seqcount_begin(&devnet_rename_seq); diff --git a/net/core/failover.c b/net/core/failover.c index 4a92a98..1fd8bbb 100644 --- a/net/core/failover.c +++ b/net/core/failover.c @@ -16,6 +16,11 @@ static LIST_HEAD(failover_list); static DEFINE_SPINLOCK(failover_lo
[RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces
When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. For that to work, now introduce a module-level tunable, "slave_rename_ok" that allows users to lift up the rename restriction on failover slave which is already UP. Although it's possible this change potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to work with the new naming behavior of the failover slave. Userspace component interacting with slaves should be changed to operate on failover master instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and all userspace should only deal with master in the long run. The default for the "slave_rename_ok" is set to true(1). If userspace doesn't have the right support in place meanwhile users don't care about reliable userspace naming, the value can be set to false(0). Signed-off-by: si-wei@oracle.com Reviewed-by: Liran Alon --- include/linux/netdevice.h | 3 +++ net/core/dev.c| 3 ++- net/core/failover.c | 11 +-- 3 files changed, 14 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 857f8ab..6d9e4e0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1487,6 +1487,7 @@ struct net_device_ops { * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook * @IFF_FAILOVER: device is a failover master device * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device + * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running */ enum netdev_priv_flags { IFF_802_1Q_VLAN = 1<<0, @@ -1518,6 +1519,7 @@ enum netdev_priv_flags { IFF_NO_RX_HANDLER = 1<<26, IFF_FAILOVER= 1<<27, IFF_FAILOVER_SLAVE = 1<<28, + IFF_SLAVE_RENAME_OK = 1<<29, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1548,6 +1550,7 @@ enum netdev_priv_flags { #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER #define IFF_FAILOVER IFF_FAILOVER #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE +#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK /** * struct net_device - The DEVICE structure. diff --git a/net/core/dev.c b/net/core/dev.c index 722d50d..ae070de 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char *newname) BUG_ON(!dev_net(dev)); net = dev_net(dev); - if (dev->flags & IFF_UP) + if (dev->flags & IFF_UP && + !(dev->priv_flags & IFF_SLAVE_RENAME_OK)) return -EBUSY; write_seqcount_begin(&devnet_rename_seq); diff --git a/net/core/failover.c b/net/core/failover.c index 4a92a98..1fd8bbb 100644 --- a/net/core/failover.c +++ b/net/core/failover.c @@ -16,6 +16,11 @@ static LIST_HEAD(failover_list); static DEFINE_SPINLOCK(failover_lo
[RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces
When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. For that to work, now introduce a module-level tunable, "slave_rename_ok" that allows users to lift up the rename restriction on failover slave which is already UP. Although it's possible this change potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to work with the new naming behavior of the failover slave. Userspace component interacting with slaves should be changed to operate on failover master instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and all userspace should only deal with master in the long run. The default for the "slave_rename_ok" is set to true(1). If userspace doesn't have the right support in place meanwhile users don't care about reliable userspace naming, the value can be set to false(0). Signed-off-by: si-wei@oracle.com Reviewed-by: Liran Alon --- include/linux/netdevice.h | 3 +++ net/core/dev.c| 3 ++- net/core/failover.c | 11 +-- 3 files changed, 14 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 857f8ab..6d9e4e0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1487,6 +1487,7 @@ struct net_device_ops { * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook * @IFF_FAILOVER: device is a failover master device * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device + * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running */ enum netdev_priv_flags { IFF_802_1Q_VLAN = 1<<0, @@ -1518,6 +1519,7 @@ enum netdev_priv_flags { IFF_NO_RX_HANDLER = 1<<26, IFF_FAILOVER= 1<<27, IFF_FAILOVER_SLAVE = 1<<28, + IFF_SLAVE_RENAME_OK = 1<<29, }; #define IFF_802_1Q_VLANIFF_802_1Q_VLAN @@ -1548,6 +1550,7 @@ enum netdev_priv_flags { #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER #define IFF_FAILOVER IFF_FAILOVER #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE +#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK /** * struct net_device - The DEVICE structure. diff --git a/net/core/dev.c b/net/core/dev.c index 722d50d..ae070de 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char *newname) BUG_ON(!dev_net(dev)); net = dev_net(dev); - if (dev->flags & IFF_UP) + if (dev->flags & IFF_UP && + !(dev->priv_flags & IFF_SLAVE_RENAME_OK)) return -EBUSY; write_seqcount_begin(&devnet_rename_seq); diff --git a/net/core/failover.c b/net/core/failover.c index 4a92a98..1fd8bbb 100644 --- a/net/core/failover.c +++ b/net/core/failover.c @@ -16,6 +16,11 @@ static LIST_HEAD(failover_list); static DEFINE_SPINLOCK(failover_lo
Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)
On 3/1/2019 5:27 AM, Michael S. Tsirkin wrote: On Thu, Feb 28, 2019 at 05:30:56PM -0800, si-wei liu wrote: On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote: On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote: Will the change break userspace further? -Siwei Didn't you show userspace is already broken. You can't "further break it", rename already fails. It's a race, userspace tends to give slave a user(space) desired name but sometimes may fail due to this race. Today if failover master is not up, rename would succeed anyway. While what you proposed prohibits user from providing a name in all circumstances if I understand you correctly. That's what I meant of breaking userspace further. On the other hand, you seem to tighten the kernel default naming to udev predictable names, which is derived from only recent systemd-udevd, while there exists many possible userspace naming schemes out of that. Users today who deliberately chooses to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to kernel provided names would expect the ethX pattern, with this change admin/user scripts which matches the ethX pattern could potentially break. Whatever crashes with a name not matching ethX will crash on the standby interface *anyway*. With udev predictable naming disabled they should not. It's not hard for user to look for device attribute to persistent the name well, in a consistent and reliable way. Well that's special code for failover already. So far we just taught userspace to skip renaming slave interfaces. I think today kernel provided names never collapse, e.g. master gets eth0 then standby will get eth1. It's the userspace specified name that suffers name clashing, mostly the default predictable naming pattern from systemd-udevd. Kernel should not assume there's only one naming pattern in userspace. Users can customize naming with udev rules in /etc which do not conform to the default udevd pattern at all. It's pretty legitimate use case. So I think what you are saying is that someone might have already written scripts and gotten them to work on v4.17 when STANDBY was included and these scripts rely on ethX. Now these scripts will break. The controversial part is the new kernel naming pattern. Initially I thought there shouldn't be such crazy scripts relying on the pattern, but when I worked on cloud-init it I realized that there's already a lot of software taking assumption around the 'eth0' name. In the past I've seen random scripts that parses the ethX name assumes (incorrectly) the name ends up with digits, or even the digits and name are 1:1 mapped. Of course, you can say these are bugs in scripts themselves. No what I say is that they will crash on rename of standby too. What do you mean crashing on standby rename? First off, if master is not up, rename on both standby and primary should not fail. If master is up, the standby should be named before userspace brings up the master, so what's the issue you talked about? Thanks, -Siwei Anyway, I'll let others in the netdev to comment on this new scheme, maybe that's the concern of merely myself. The good part of your proposal is that we can get consistent slave name, which still plays its role until we move towards making slave names less relevant, i.e. ideally a 1-netdev model. I think we both agree that the master matters more than the slave names. Maybe it is still early enough (just half a year passed) that the number of these users would be small. So how about a kernel config option and maybe a module parameter to rename the primary? People can then opt in to the old broken behaviour. Were I could I would ask why a similar opt-in (kernel config or module parameter) couldn't be implemented to open up the rename restriction on slave, net_failover in particular. What I felt about this rename restriction was more because of historical reason than anything else, while net_failover is comparatively a new type of link that we are now designing proper use case it should support, and can get it shaped to whatever it fits. My personal view is that the slave can't be renamed when master is running is just implementation details that got incorrectly exposed to userspace apps for many years. It's old behavior with historical reason for sure, but I don't think this applies to net_failover. (FWIW as one previous bond maintainer for another OS, we relieved the rename restriction slaves 13 year ago, while no single complaint or issue was ever raised because of this change over the years, neither from the customers of tens of millions of installation base, nor the FOSS software running atop. Of course, Linux is different so that experience doesn't count.) Thanks, -Siwei