Dear Stephen, Thank you again for taking the time to review my patch-set. I have incorporated your suggestions and submitted patch-set v8. The responses to your comments are inline
Internal Use - Confidential > -----Original Message----- > From: Stephen Hemminger <[email protected]> > Sent: Monday, March 30, 2026 8:26 PM > To: Bathija, Pravin <[email protected]> > Cc: [email protected]; [email protected]; > [email protected] > Subject: Re: [PATCH v7 0/5] Support add/remove memory region & get-max- > slots > > > [EXTERNAL EMAIL] > > On Wed, 11 Feb 2026 10:24:28 +0000 > <[email protected]> wrote: > > > From: Pravin M Bathija <[email protected]> > > > > This is version v7 of the patchset and it incorporates the > > recommendations made by Maxime Coquelin and Feng Cheng Wen. > > The patchset includes support for adding and removal of memory > > regions, getting max memory slots and other changes to vhost-user > > messages. These messages are sent from vhost-user front-end (qemu or > > libblkio) to a vhost-user back-end (dpdk, spdk). Support functions for > > these message functions have been implemented in the interest of > > writing optimized code. Older functions, part of vhost-user back-end > > have also been optimized using these newly defined support functions. > > This implementation has been extensively tested by doing Read/Write > > I/O from multiple instances of fio + libblkio (front-end) talking to > > spdk/dpdk (back-end) based drives. Tested with qemu front-end talking > > to dpdk + testpmd (back-end) performing add/removal of memory regions. > > Also tested post-copy live migration after doing add_memory_region. > > > > Version Log: > > Version v7 (Current version): Incorporate code review suggestions from > > Maxime Coquelin. Add debug messages to vhost_postcopy_register > > function. > > Version v6: Added the enablement of this feature as a final patch in > > this patch-set and other code optimizations as suggested by Maxime > > Coquelin. > > Version v5: removed the patch that increased the number of memory > > regions from 8 to 128. This will be submitted as a separate feature at > > a later point after incorporating additional optimizations. Also > > includes code optimizations as suggested by Feng Cheng Wen. > > Version v4: code optimizations as suggested by Feng Cheng Wen. > > Version v3: code optimizations as suggested by Maxime Coquelin and > > Thomas Monjalon. > > Version v2: code optimizations as suggested by Maxime Coquelin. > > Version v1: Initial patch set. > > > > Pravin M Bathija (5): > > vhost: add user to mailmap and define to vhost hdr > > vhost_user: header defines for add/rem mem region > > vhost_user: support function defines for back-end > > vhost_user: Function defs for add/rem mem regions > > vhost_user: enable configure memory slots > > > > .mailmap | 1 + > > lib/vhost/rte_vhost.h | 4 + > > lib/vhost/vhost_user.c | 392 > > ++++++++++++++++++++++++++++++++++++----- > > lib/vhost/vhost_user.h | 10 ++ > > 4 files changed, 365 insertions(+), 42 deletions(-) > > > > Lots more stuff found by AI code review, this is not ready. > > Review of [PATCH v7 1-5/5] vhost: configure memory slots support > Author: Pravin M Bathija <[email protected]> > > This patch series adds support for the > VHOST_USER_F_CONFIGURE_MEM_SLOTS protocol feature, enabling > add/remove individual memory regions instead of requiring a full > SET_MEM_TABLE each time. The approach is sound and the feature is needed, > but several correctness bugs need to be addressed. > > Patch 3/5 -- vhost_user: support function defines for back-end > -------------------------------------------------------------------- > > Error: dev_invalidate_vrings loses *pdev update after numa_realloc > > translate_ring_addresses() calls numa_realloc(), which can reallocate > the dev struct and update *pdev. In the original set_mem_table the > caller writes "*pdev = dev;" after the call to observe this. > > dev_invalidate_vrings() takes a plain "struct virtio_net *dev" and > passes "&dev, &vq" to translate_ring_addresses -- but dev is a local > copy. If numa_realloc reallocates, the caller's *pdev (in > vhost_user_add_mem_reg) is never updated, and all subsequent accesses > use a stale/freed pointer -- a use-after-free. > > The function should take struct virtio_net **pdev (matching the > pattern used in set_mem_table) and propagate updates back. > Fixed. Changed dev_invalidate_vrings signature to take struct virtio_net **pdev. The function dereferences *pdev locally, passes &dev to translate_ring_addresses, and writes *pdev = dev at the end to propagate any reallocation. Both callers (add_mem_reg and rem_mem_reg) now pass pdev and update their local dev = *pdev afterward. > Error: async_dma_map_region can underflow reg_size > > The loop "reg_size -= page->size" uses unsigned subtraction. If > page->size does not evenly divide reg->size (which can happen with > misaligned guest pages), reg_size wraps to a huge value and the loop > runs out of bounds. The original async_dma_map avoids this by > iterating over nr_guest_pages directly. > > Suggest checking "reg_size >= page->size" before subtracting, or > better yet iterating like the original does. > Fixed. Rewrote async_dma_map_region to iterate all dev->guest_pages entries and select only those whose host_user_addr falls within [reg->host_user_addr, reg->host_user_addr + reg->size). This eliminates the contiguous-index assumption entirely. This rewrite eliminates reg_size tracking entirely. The function now iterates all guest pages and filters by host_user_addr range, so there is no subtraction that can underflow. > Warning: free_all_mem_regions iterates all > VHOST_MEMORY_MAX_NREGIONS > but the original free_mem_region iterated only dev->mem->nregions > > This is intentional for the sparse-slot design, but free_mem_region() > checks "reg->host_user_addr" while free_all_mem_regions checks > "reg->mmap_addr" as the sentinel. These should be consistent. > A region that was partially initialized (mmap_addr set but > host_user_addr not yet set, or vice versa) would behave > differently depending on which sentinel is checked. > Fixed. Changed free_mem_region to check reg->mmap_addr instead of reg->host_user_addr, making both functions consistent. > Warning: vhost_user_initialize_memory unconditionally allocates > dev->mem without freeing any prior allocation > > If dev->mem is non-NULL when vhost_user_initialize_memory is called, > the old allocation is leaked. set_mem_table frees dev->mem before > calling it, but add_mem_reg only checks "dev->mem == NULL" and skips > the call otherwise. This is safe as coded, but the function itself > is fragile -- consider adding an assertion or early return if > dev->mem is already set. > Fixed. Added a guard at the top of vhost_user_initialize_memory: if dev->mem != NULL, log an error and return -1. This makes the function safe against accidental double-initialization. > Info: vhost_user_postcopy_register changes use reg_msg_index but the > second loop (postcopy_region_register) still iterates dev->mem->regions > with nregions as the bound and just skips zeros. This is correct > but the two loops use different iteration strategies which is > confusing. Consider making both iterate VHOST_MEMORY_MAX_NREGIONS > with a skip-zero check for consistency. Fixed differently. vhost_user_add_mem_reg no longer calls vhost_user_postcopy_register() at all. That function reads ctx- >msg.payload.memory (the SET_MEM_TABLE union member), but ADD_MEM_REG uses the >memory_single payload — reading the wrong union member would yield garbage for nregions. Instead, add_mem_reg now calls vhost_user_postcopy_region_register(dev, reg) directly for the single new region, bypassing the message-based iteration entirely. > > > Patch 4/5 -- vhost_user: Function defs for add/rem mem regions > -------------------------------------------------------------------- > > Error: vhost_user_add_mem_reg does not set ctx->fds[0] = -1 after > assigning reg->fd = ctx->fds[0] > > In the original set_mem_table, after "reg->fd = ctx->fds[i]" the > code sets "ctx->fds[i] = -1" to prevent the fd from being double- > closed if the error path calls close_msg_fds(). The new add_mem_reg > omits this. If vhost_user_mmap_region or vhost_user_postcopy_register > fails, the error path calls close_msg_fds(ctx) which closes > ctx->fds[0] -- but reg->fd still holds the same value. Later, > free_all_mem_regions will close(reg->fd) again -- double close. > > Fix: add "ctx->fds[0] = -1;" after "reg->fd = ctx->fds[0];". > Fixed. Added ctx->fds[0] = -1; immediately after reg->fd = ctx->fds[0]; to prevent close_msg_fds from double-closing the fd on error paths. > Error: vhost_user_add_mem_reg error path frees all memory on > single-region failure > > If postcopy_register fails after successfully mmap'ing one region > (when other regions already exist), the "free_mem_table" label > calls free_all_mem_regions + rte_free(dev->mem) + rte_free > (dev->guest_pages), destroying ALL previously registered regions. > This is disproportionate -- the error should only unmap the single > region that was just added and decrement nregions. > Fixed. Replaced the free_mem_table label with free_new_region, which only cleans up the single failed region: DMA unmaps the region, removes its guest page entries, calls free_mem_region on the single region, and decrements nregions. Existing regions are untouched. > Error: overlap check uses mmap_size for existing region but > memory_size for proposed region > > The overlap check compares: > current_region_guest_end = guest_user_addr + mmap_size - 1 > proposed_region_guest_end = userspace_addr + memory_size - 1 > > mmap_size = memory_size + mmap_offset (set by > vhost_user_mmap_region). > So the existing region's range is inflated by the mmap_offset, > potentially producing false overlaps. Use current_region->size > (which corresponds to memory_size) instead of mmap_size for a > correct comparison. > Fixed. Changed the overlap check to use current_region->size (which corresponds to memory_size from the protocol) instead of current_region->mmap_size, in both the range calculation and the error log message. > Error: vhost_user_add_mem_reg uses guest_user_addr == 0 as "slot > is empty" but guest_user_addr 0 is a valid guest virtual address > > The free-slot search does: > if (dev->mem->regions[i].guest_user_addr == 0) > > Guest address 0 is valid (common in some VM configurations). > If a region is mapped at guest VA 0, it would appear as a free > slot. The original code uses compact packing (nregions as the > count), avoiding this problem. Consider using a dedicated > "in_use" flag or checking mmap_addr == NULL instead (mmap > never returns NULL on success, it returns MAP_FAILED). > Fixed. The regions array is now kept compact (contiguous from index 0 to nregions-1) via memmove compaction in rem_mem_reg. New regions are always placed at regions[dev->mem->nregions] — no free-slot search is needed, eliminating the guest_user_addr == 0 sentinel entirely. > Warning: vhost_user_rem_mem_reg matches on guest_phys_addr but the > spec says "identified by guest address, user address and size" > > The comment in the code says "identified by its guest address, > user address and size. The mmap offset is ignored." but the > comparison also checks guest_phys_addr (which is the GPA, not > the guest user address). The spec says matching is by > userspace_addr (GVA), guest_phys_addr (GPA), and memory_size. > So the match is actually correct in practice, but the comment > is misleading -- it should mention GPA too, or remove the > comment about what the spec says. > Fixed. Updated the comment to: "The memory region to be removed is identified by its GPA, user address and size." > Warning: nregions becomes inconsistent with actual slot usage > > After add_mem_reg, nregions is incremented. After rem_mem_reg, > nregions is decremented. But regions are not packed -- removing > a region from the middle leaves a hole. Functions like > qva_to_vva, hva_to_gpa, and rte_vhost_va_from_guest_pa all > iterate "i < mem->nregions" and index "mem->regions[i]" > sequentially. With sparse slots these functions will miss > regions that are stored at indices >= nregions. > > Example: 3 regions at slots 0,1,2 (nregions=3). Remove slot 1 > (memset to 0, nregions=2). Now qva_to_vva iterates slots 0,1 > only -- slot 2 (still valid) is never checked. Address > translations for memory in the third region will fail. > > This is a correctness bug that will cause packet processing > failures. Either pack the array on removal (shift later entries > down) or change all iteration to use VHOST_MEMORY_MAX_NREGIONS > with skip-zero checks. > Fixed. rem_mem_reg now compacts the regions array after removal using memmove to shift later entries down, then zeroes the trailing slot. This keeps the array contiguous from index 0 to nregions-1, so all existing iteration using i < nregions continues to work correctly. > Warning: dev_invalidate_vrings does not update *pdev in add_mem_reg > > Related to the Error in patch 3 -- even if dev_invalidate_vrings > is fixed to take **pdev, the caller vhost_user_add_mem_reg must > also write "*pdev = dev;" to propagate the possibly-reallocated > pointer back through the message handler framework. > Fixed. Changed dev_invalidate_vrings signature to take struct virtio_net **pdev. The function dereferences *pdev locally, passes &dev to translate_ring_addresses, and writes *pdev = dev at the end to propagate any reallocation. Both callers (add_mem_reg and rem_mem_reg) now pass pdev and update their local dev = *pdev afterward. > > Patch 5/5 -- vhost_user: enable configure memory slots > -------------------------------------------------------------------- > > Warning: the feature is enabled unconditionally but several > address translation functions (qva_to_vva, hva_to_gpa, > rte_vhost_va_from_guest_pa) have not been updated to handle > sparse region arrays. Enabling the feature flag before fixing > the nregions iteration issue means any front-end that uses > ADD_MEM_REG/REM_MEM_REG will hit broken address translations. > Resolved by the compaction fix in patch 4 (comment #18). Since the regions array is always kept contiguous, these functions work correctly without modification. No changes needed in patch 5. > > Summary > -------------------------------------------------------------------- > > The most critical issues to fix before this can be merged: > > 1. nregions vs sparse-slot iteration mismatch -- this will cause > address translation failures at runtime after any REM_MEM_REG. > All loops that iterate mem->regions must either use compact > packing or be updated to scan all VHOST_MEMORY_MAX_NREGIONS > slots with skip-zero checks. > > 2. dev_invalidate_vrings / translate_ring_addresses *pdev propagation > -- use-after-free if numa_realloc fires. > > 3. Missing ctx->fds[0] = -1 in add_mem_reg -- double-close on > error paths. > > 4. Overlap check using mmap_size instead of size -- false positive > overlaps. > > 5. Error path in add_mem_reg destroys all regions instead of just > the failed one. Additional fix found during testing VHOST_USER_REM_MEM_REG registered with accepts_fd = false The vhost-user protocol sends an FD with REM_MEM_REG messages. The handler was registered with accepts_fd = false, causing validate_msg_fds to reject every REM_MEM_REG message with "expect 0 FDs, received 1". Fixed by changing the registration to accepts_fd = true and adding close_msg_fds(ctx) calls in all exit paths of rem_mem_reg to properly close the received FD.

