Dear Stephen,

Thank you again for taking the time to review my patch-set. I have incorporated 
your suggestions and submitted patch-set v8.
The responses to your comments are inline


Internal Use - Confidential
> -----Original Message-----
> From: Stephen Hemminger <[email protected]>
> Sent: Monday, March 30, 2026 8:26 PM
> To: Bathija, Pravin <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH v7 0/5] Support add/remove memory region & get-max-
> slots
>
>
> [EXTERNAL EMAIL]
>
> On Wed, 11 Feb 2026 10:24:28 +0000
> <[email protected]> wrote:
>
> > From: Pravin M Bathija <[email protected]>
> >
> > This is version v7 of the patchset and it incorporates the
> > recommendations made by Maxime Coquelin and Feng Cheng Wen.
> > The patchset includes support for adding and removal of memory
> > regions, getting max memory slots and other changes to vhost-user
> > messages.  These messages are sent from vhost-user front-end (qemu or
> > libblkio) to a vhost-user back-end (dpdk, spdk). Support functions for
> > these message functions have been implemented in the interest of
> > writing optimized code. Older functions, part of vhost-user back-end
> > have also been optimized using these newly defined support functions.
> > This implementation has been extensively tested by doing Read/Write
> > I/O from multiple instances of fio + libblkio (front-end) talking to
> > spdk/dpdk (back-end) based drives. Tested with qemu front-end talking
> > to dpdk + testpmd (back-end) performing add/removal of memory regions.
> > Also tested post-copy live migration after doing add_memory_region.
> >
> > Version Log:
> > Version v7 (Current version): Incorporate code review suggestions from
> > Maxime Coquelin. Add debug messages to vhost_postcopy_register
> > function.
> > Version v6: Added the enablement of this feature as a final patch in
> > this patch-set and other code optimizations as suggested by Maxime
> > Coquelin.
> > Version v5: removed the patch that increased the number of memory
> > regions from 8 to 128. This will be submitted as a separate feature at
> > a later point after incorporating additional optimizations. Also
> > includes code optimizations as suggested by Feng Cheng Wen.
> > Version v4: code optimizations as suggested by Feng Cheng Wen.
> > Version v3: code optimizations as suggested by Maxime Coquelin and
> > Thomas Monjalon.
> > Version v2: code optimizations as suggested by Maxime Coquelin.
> > Version v1: Initial patch set.
> >
> > Pravin M Bathija (5):
> >   vhost: add user to mailmap and define to vhost hdr
> >   vhost_user: header defines for add/rem mem region
> >   vhost_user: support function defines for back-end
> >   vhost_user: Function defs for add/rem mem regions
> >   vhost_user: enable configure memory slots
> >
> >  .mailmap               |   1 +
> >  lib/vhost/rte_vhost.h  |   4 +
> >  lib/vhost/vhost_user.c | 392
> > ++++++++++++++++++++++++++++++++++++-----
> >  lib/vhost/vhost_user.h |  10 ++
> >  4 files changed, 365 insertions(+), 42 deletions(-)
> >
>
> Lots more stuff found by AI code review, this is not ready.
>
> Review of [PATCH v7 1-5/5] vhost: configure memory slots support
> Author: Pravin M Bathija <[email protected]>
>
> This patch series adds support for the
> VHOST_USER_F_CONFIGURE_MEM_SLOTS protocol feature, enabling
> add/remove individual memory regions instead of requiring a full
> SET_MEM_TABLE each time.  The approach is sound and the feature is needed,
> but several correctness bugs need to be addressed.
>
> Patch 3/5 -- vhost_user: support function defines for back-end
> --------------------------------------------------------------------
>
> Error: dev_invalidate_vrings loses *pdev update after numa_realloc
>
>   translate_ring_addresses() calls numa_realloc(), which can reallocate
>   the dev struct and update *pdev.  In the original set_mem_table the
>   caller writes "*pdev = dev;" after the call to observe this.
>
>   dev_invalidate_vrings() takes a plain "struct virtio_net *dev" and
>   passes "&dev, &vq" to translate_ring_addresses -- but dev is a local
>   copy.  If numa_realloc reallocates, the caller's *pdev (in
>   vhost_user_add_mem_reg) is never updated, and all subsequent accesses
>   use a stale/freed pointer -- a use-after-free.
>
>   The function should take struct virtio_net **pdev (matching the
>   pattern used in set_mem_table) and propagate updates back.
>

Fixed. Changed dev_invalidate_vrings signature to take struct virtio_net 
**pdev. The function dereferences *pdev locally,
passes &dev to translate_ring_addresses, and writes *pdev = dev at the end to 
propagate any reallocation. Both callers
(add_mem_reg and rem_mem_reg) now pass pdev and update their local dev = *pdev 
afterward.

> Error: async_dma_map_region can underflow reg_size
>
>   The loop "reg_size -= page->size" uses unsigned subtraction. If
>   page->size does not evenly divide reg->size (which can happen with
>   misaligned guest pages), reg_size wraps to a huge value and the loop
>   runs out of bounds.  The original async_dma_map avoids this by
>   iterating over nr_guest_pages directly.
>
>   Suggest checking "reg_size >= page->size" before subtracting, or
>   better yet iterating like the original does.
>

Fixed. Rewrote async_dma_map_region to iterate all dev->guest_pages entries and 
select only those whose host_user_addr falls
within [reg->host_user_addr, reg->host_user_addr + reg->size). This eliminates 
the contiguous-index assumption entirely.
This rewrite eliminates reg_size tracking entirely. The function now iterates 
all guest pages and filters by host_user_addr range,
so there is no subtraction that can underflow.

> Warning: free_all_mem_regions iterates all
> VHOST_MEMORY_MAX_NREGIONS
>   but the original free_mem_region iterated only dev->mem->nregions
>
>   This is intentional for the sparse-slot design, but free_mem_region()
>   checks "reg->host_user_addr" while free_all_mem_regions checks
>   "reg->mmap_addr" as the sentinel.  These should be consistent.
>   A region that was partially initialized (mmap_addr set but
>   host_user_addr not yet set, or vice versa) would behave
>   differently depending on which sentinel is checked.
>

Fixed. Changed free_mem_region to check reg->mmap_addr instead of 
reg->host_user_addr, making both functions consistent.

> Warning: vhost_user_initialize_memory unconditionally allocates
>   dev->mem without freeing any prior allocation
>
>   If dev->mem is non-NULL when vhost_user_initialize_memory is called,
>   the old allocation is leaked.  set_mem_table frees dev->mem before
>   calling it, but add_mem_reg only checks "dev->mem == NULL" and skips
>   the call otherwise.  This is safe as coded, but the function itself
>   is fragile -- consider adding an assertion or early return if
>   dev->mem is already set.
>

Fixed. Added a guard at the top of vhost_user_initialize_memory: if dev->mem != 
NULL, log an error and return -1. This makes
the function safe against accidental double-initialization.

> Info: vhost_user_postcopy_register changes use reg_msg_index but the
>   second loop (postcopy_region_register) still iterates dev->mem->regions
>   with nregions as the bound and just skips zeros.  This is correct
>   but the two loops use different iteration strategies which is
>   confusing.  Consider making both iterate VHOST_MEMORY_MAX_NREGIONS
>   with a skip-zero check for consistency.

Fixed differently. vhost_user_add_mem_reg no longer calls 
vhost_user_postcopy_register() at all. That function reads ctx-
>msg.payload.memory (the SET_MEM_TABLE union member), but ADD_MEM_REG uses the 
>memory_single payload — reading the wrong union
member would yield garbage for nregions. Instead, add_mem_reg now calls 
vhost_user_postcopy_region_register(dev, reg) directly
for the single new region, bypassing the message-based iteration entirely.

>
>
> Patch 4/5 -- vhost_user: Function defs for add/rem mem regions
> --------------------------------------------------------------------
>
> Error: vhost_user_add_mem_reg does not set ctx->fds[0] = -1 after
>   assigning reg->fd = ctx->fds[0]
>
>   In the original set_mem_table, after "reg->fd = ctx->fds[i]" the
>   code sets "ctx->fds[i] = -1" to prevent the fd from being double-
>   closed if the error path calls close_msg_fds().  The new add_mem_reg
>   omits this.  If vhost_user_mmap_region or vhost_user_postcopy_register
>   fails, the error path calls close_msg_fds(ctx) which closes
>   ctx->fds[0] -- but reg->fd still holds the same value.  Later,
>   free_all_mem_regions will close(reg->fd) again -- double close.
>
>   Fix: add "ctx->fds[0] = -1;" after "reg->fd = ctx->fds[0];".
>

Fixed. Added ctx->fds[0] = -1; immediately after reg->fd = ctx->fds[0]; to 
prevent close_msg_fds from double-closing the fd on
error paths.

> Error: vhost_user_add_mem_reg error path frees all memory on
>   single-region failure
>
>   If postcopy_register fails after successfully mmap'ing one region
>   (when other regions already exist), the "free_mem_table" label
>   calls free_all_mem_regions + rte_free(dev->mem) + rte_free
>   (dev->guest_pages), destroying ALL previously registered regions.
>   This is disproportionate -- the error should only unmap the single
>   region that was just added and decrement nregions.
>

Fixed. Replaced the free_mem_table label with free_new_region, which only 
cleans up the single failed region: DMA unmaps the
region, removes its guest page entries, calls free_mem_region on the single 
region, and decrements nregions. Existing regions
are untouched.

> Error: overlap check uses mmap_size for existing region but
>   memory_size for proposed region
>
>   The overlap check compares:
>     current_region_guest_end = guest_user_addr + mmap_size - 1
>     proposed_region_guest_end = userspace_addr + memory_size - 1
>
>   mmap_size = memory_size + mmap_offset (set by
> vhost_user_mmap_region).
>   So the existing region's range is inflated by the mmap_offset,
>   potentially producing false overlaps.  Use current_region->size
>   (which corresponds to memory_size) instead of mmap_size for a
>   correct comparison.
>

Fixed. Changed the overlap check to use current_region->size (which corresponds 
to memory_size from the protocol) instead of
current_region->mmap_size, in both the range calculation and the error log 
message.

> Error: vhost_user_add_mem_reg uses guest_user_addr == 0 as "slot
>   is empty" but guest_user_addr 0 is a valid guest virtual address
>
>   The free-slot search does:
>     if (dev->mem->regions[i].guest_user_addr == 0)
>
>   Guest address 0 is valid (common in some VM configurations).
>   If a region is mapped at guest VA 0, it would appear as a free
>   slot.  The original code uses compact packing (nregions as the
>   count), avoiding this problem.  Consider using a dedicated
>   "in_use" flag or checking mmap_addr == NULL instead (mmap
>   never returns NULL on success, it returns MAP_FAILED).
>

Fixed. The regions array is now kept compact (contiguous from index 0 to 
nregions-1) via memmove compaction in rem_mem_reg.
New regions are always placed at regions[dev->mem->nregions] — no free-slot 
search is needed, eliminating the guest_user_addr
== 0 sentinel entirely.

> Warning: vhost_user_rem_mem_reg matches on guest_phys_addr but the
>   spec says "identified by guest address, user address and size"
>
>   The comment in the code says "identified by its guest address,
>   user address and size. The mmap offset is ignored." but the
>   comparison also checks guest_phys_addr (which is the GPA, not
>   the guest user address).  The spec says matching is by
>   userspace_addr (GVA), guest_phys_addr (GPA), and memory_size.
>   So the match is actually correct in practice, but the comment
>   is misleading -- it should mention GPA too, or remove the
>   comment about what the spec says.
>

Fixed. Updated the comment to: "The memory region to be removed is identified 
by its GPA, user address and size."

> Warning: nregions becomes inconsistent with actual slot usage
>
>   After add_mem_reg, nregions is incremented.  After rem_mem_reg,
>   nregions is decremented.  But regions are not packed -- removing
>   a region from the middle leaves a hole.  Functions like
>   qva_to_vva, hva_to_gpa, and rte_vhost_va_from_guest_pa all
>   iterate "i < mem->nregions" and index "mem->regions[i]"
>   sequentially.  With sparse slots these functions will miss
>   regions that are stored at indices >= nregions.
>
>   Example: 3 regions at slots 0,1,2 (nregions=3).  Remove slot 1
>   (memset to 0, nregions=2).  Now qva_to_vva iterates slots 0,1
>   only -- slot 2 (still valid) is never checked.  Address
>   translations for memory in the third region will fail.
>
>   This is a correctness bug that will cause packet processing
>   failures.  Either pack the array on removal (shift later entries
>   down) or change all iteration to use VHOST_MEMORY_MAX_NREGIONS
>   with skip-zero checks.
>

Fixed. rem_mem_reg now compacts the regions array after removal using memmove 
to shift later entries down, then zeroes the
trailing slot. This keeps the array contiguous from index 0 to nregions-1, so 
all existing iteration using i < nregions
continues to work correctly.

> Warning: dev_invalidate_vrings does not update *pdev in add_mem_reg
>
>   Related to the Error in patch 3 -- even if dev_invalidate_vrings
>   is fixed to take **pdev, the caller vhost_user_add_mem_reg must
>   also write "*pdev = dev;" to propagate the possibly-reallocated
>   pointer back through the message handler framework.
>

Fixed. Changed dev_invalidate_vrings signature to take struct virtio_net 
**pdev. The function dereferences *pdev locally,
passes &dev to translate_ring_addresses, and writes *pdev = dev at the end to 
propagate any reallocation. Both callers
(add_mem_reg and rem_mem_reg) now pass pdev and update their local dev = *pdev 
afterward.

>
> Patch 5/5 -- vhost_user: enable configure memory slots
> --------------------------------------------------------------------
>
> Warning: the feature is enabled unconditionally but several
>   address translation functions (qva_to_vva, hva_to_gpa,
>   rte_vhost_va_from_guest_pa) have not been updated to handle
>   sparse region arrays.  Enabling the feature flag before fixing
>   the nregions iteration issue means any front-end that uses
>   ADD_MEM_REG/REM_MEM_REG will hit broken address translations.
>

Resolved by the compaction fix in patch 4 (comment #18). Since the regions 
array is always kept contiguous, these functions
work correctly without modification. No changes needed in patch 5.

>
> Summary
> --------------------------------------------------------------------
>
> The most critical issues to fix before this can be merged:
>
> 1. nregions vs sparse-slot iteration mismatch -- this will cause
>    address translation failures at runtime after any REM_MEM_REG.
>    All loops that iterate mem->regions must either use compact
>    packing or be updated to scan all VHOST_MEMORY_MAX_NREGIONS
>    slots with skip-zero checks.
>
> 2. dev_invalidate_vrings / translate_ring_addresses *pdev propagation
>    -- use-after-free if numa_realloc fires.
>
> 3. Missing ctx->fds[0] = -1 in add_mem_reg -- double-close on
>    error paths.
>
> 4. Overlap check using mmap_size instead of size -- false positive
>    overlaps.
>
> 5. Error path in add_mem_reg destroys all regions instead of just
>    the failed one.

Additional fix found during testing
VHOST_USER_REM_MEM_REG registered with accepts_fd = false

The vhost-user protocol sends an FD with REM_MEM_REG messages. The handler was 
registered with accepts_fd = false, causing
validate_msg_fds to reject every REM_MEM_REG message with "expect 0 FDs, 
received 1". Fixed by changing the registration to
accepts_fd = true and adding close_msg_fds(ctx) calls in all exit paths of 
rem_mem_reg to properly close the received FD.


Reply via email to