On Thu, Dec 04, 2025 at 04:48:01PM +0000, Michael Kelley wrote: > From: Stanislav Kinsburskii <[email protected]> Sent: Tuesday, > November 25, 2025 6:09 PM > > > > Introduce kref-based reference counting and spinlock protection for > > memory regions in Hyper-V partition management. This change improves > > memory region lifecycle management and ensures thread-safe access to the > > region list. > > > > Also improves the check for overlapped memory regions during region > > creation, preventing duplicate or conflicting mappings. > > This paragraph seems spurious. I think it applies to what's in Patch 5 of this > series. >
Indeed,this chunk escaped cleanup after refactoring. > > > > Previously, the regions list was protected by the partition mutex. > > However, this approach is too heavy for frequent fault and invalidation > > operations. Finer grained locking is now used to improve efficiency and > > concurrency. > > > > This is a precursor to supporting movable memory regions. Fault and > > invalidation handling for movable regions will require safe traversal of > > the region list and holding a region reference while performing > > invalidation or fault operations. > > The commit message discussion about the need for the refcounting and > locking seemed a bit vague to me. It wasn't entirely clear whether these > changes are bug fixing existing race conditions, or whether they are new > functionality to support movable regions. > > In looking at the existing code, it seems that the main serialization > mechanisms > are that partition ioctls are serialized on pt_mutex, and VP ioctls are > serialized > on vp_mutex (though multiple VP ioctls can be in progress simultaneously > against different VPs). The serialization of partition ioctls ensures that > region > manipulation is serialized, and that, for example, two region creations can't > both verify that there's no overlap, but then overlap with each other. And > region creation and deletion are serialized. In current code, the VP ioctls > don't > look at the region data structures, so there can't be any races between > partition and VP ioctls (which are not serialized with each other). The only > question I had about existing code is the mshv_partition_release() function, > which proceeds without serializing against any partition ioctls, but maybe > higher-level file system code ensures that no ioctls are in progress before > the .release callback is made. > > The new requirement is movable regions, where the VP ioctl MSHV_RUN_VP > needs to look at region data structures. You've said that in the last > paragraph > of your commit message. So I'm reading this as that the new locking is > needed specifically because multiple MSHV_RUN_VP ioctls will likely be > in flight simultaneously, and they are not currently serialized with the > region operations initiated by partition ioctls. And then there are the > "invalidate" callbacks that are running on some other kernel thread and > which also needs synchronization to do region manipulation. > > Maybe I'm just looking for a little bit of a written "road map" somewhere > that describes the intended locking scheme at a high level. :-) > > Michael > You understand this correctly. In short, there were only two concurrent operations on regions before movable pages were introduced: addition and removal. Both could happen only via the partition ioctl, which is serialized by the partition mutex, so everything was simple. With the introduction of movable pages, regions — both the list of regions and the region contents themselves — are accessed by partition VP threads, which do not hold the partition mutex. While access to region contents is protected by a per-region mutex, nothing prevents the VMM from removing and destroying a region from underneath a VP thread that is currently servicing a page fault or invalidation. This, in turn, leads to a general protection fault. This commit solves the issue by making the region a reference-counted object so it persists while being serviced, and by adding a spinlock to protect list traversal. Thanks, Stanislav > > > > Signed-off-by: Stanislav Kinsburskii <[email protected]> > > --- > > drivers/hv/mshv_regions.c | 19 ++++++++++++++++--- > > drivers/hv/mshv_root.h | 6 +++++- > > drivers/hv/mshv_root_main.c | 34 ++++++++++++++++++++++++++-------- > > 3 files changed, 47 insertions(+), 12 deletions(-) > > > > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c > > index d535d2e3e811..6450a7ed8493 100644 > > --- a/drivers/hv/mshv_regions.c > > +++ b/drivers/hv/mshv_regions.c > > @@ -7,6 +7,7 @@ > > * Authors: Microsoft Linux virtualization team > > */ > > > > +#include <linux/kref.h> > > #include <linux/mm.h> > > #include <linux/vmalloc.h> > > > > @@ -154,6 +155,8 @@ struct mshv_mem_region *mshv_region_create(u64 > > guest_pfn, u64 nr_pages, > > if (!is_mmio) > > region->flags.range_pinned = true; > > > > + kref_init(®ion->refcount); > > + > > return region; > > } > > > > @@ -303,13 +306,13 @@ static int mshv_region_unmap(struct mshv_mem_region > > *region) > > mshv_region_chunk_unmap); > > } > > > > -void mshv_region_destroy(struct mshv_mem_region *region) > > +static void mshv_region_destroy(struct kref *ref) > > { > > + struct mshv_mem_region *region = > > + container_of(ref, struct mshv_mem_region, refcount); > > struct mshv_partition *partition = region->partition; > > int ret; > > > > - hlist_del(®ion->hnode); > > - > > if (mshv_partition_encrypted(partition)) { > > ret = mshv_region_share(region); > > if (ret) { > > @@ -326,3 +329,13 @@ void mshv_region_destroy(struct mshv_mem_region > > *region) > > > > vfree(region); > > } > > + > > +void mshv_region_put(struct mshv_mem_region *region) > > +{ > > + kref_put(®ion->refcount, mshv_region_destroy); > > +} > > + > > +int mshv_region_get(struct mshv_mem_region *region) > > +{ > > + return kref_get_unless_zero(®ion->refcount); > > +} > > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h > > index ff3374f13691..4249534ba900 100644 > > --- a/drivers/hv/mshv_root.h > > +++ b/drivers/hv/mshv_root.h > > @@ -72,6 +72,7 @@ do { \ > > > > struct mshv_mem_region { > > struct hlist_node hnode; > > + struct kref refcount; > > u64 nr_pages; > > u64 start_gfn; > > u64 start_uaddr; > > @@ -97,6 +98,8 @@ struct mshv_partition { > > u64 pt_id; > > refcount_t pt_ref_count; > > struct mutex pt_mutex; > > + > > + spinlock_t pt_mem_regions_lock; > > struct hlist_head pt_mem_regions; // not ordered > > > > u32 pt_vp_count; > > @@ -319,6 +322,7 @@ int mshv_region_unshare(struct mshv_mem_region *region); > > int mshv_region_map(struct mshv_mem_region *region); > > void mshv_region_invalidate(struct mshv_mem_region *region); > > int mshv_region_pin(struct mshv_mem_region *region); > > -void mshv_region_destroy(struct mshv_mem_region *region); > > +void mshv_region_put(struct mshv_mem_region *region); > > +int mshv_region_get(struct mshv_mem_region *region); > > > > #endif /* _MSHV_ROOT_H_ */ > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c > > index ae600b927f49..1ef2a28beb17 100644 > > --- a/drivers/hv/mshv_root_main.c > > +++ b/drivers/hv/mshv_root_main.c > > @@ -1086,9 +1086,13 @@ static int mshv_partition_create_region(struct > > mshv_partition *partition, > > u64 nr_pages = HVPFN_DOWN(mem->size); > > > > /* Reject overlapping regions */ > > + spin_lock(&partition->pt_mem_regions_lock); > > if (mshv_partition_region_by_gfn(partition, mem->guest_pfn) || > > - mshv_partition_region_by_gfn(partition, mem->guest_pfn + nr_pages - > > 1)) > > + mshv_partition_region_by_gfn(partition, mem->guest_pfn + nr_pages - > > 1)) { > > + spin_unlock(&partition->pt_mem_regions_lock); > > return -EEXIST; > > + } > > + spin_unlock(&partition->pt_mem_regions_lock); > > > > rg = mshv_region_create(mem->guest_pfn, nr_pages, > > mem->userspace_addr, mem->flags, > > @@ -1220,8 +1224,9 @@ mshv_map_user_memory(struct mshv_partition *partition, > > if (ret) > > goto errout; > > > > - /* Install the new region */ > > + spin_lock(&partition->pt_mem_regions_lock); > > hlist_add_head(®ion->hnode, &partition->pt_mem_regions); > > + spin_unlock(&partition->pt_mem_regions_lock); > > > > return 0; > > > > @@ -1240,17 +1245,27 @@ mshv_unmap_user_memory(struct mshv_partition > > *partition, > > if (!(mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))) > > return -EINVAL; > > > > + spin_lock(&partition->pt_mem_regions_lock); > > + > > region = mshv_partition_region_by_gfn(partition, mem.guest_pfn); > > - if (!region) > > - return -EINVAL; > > + if (!region) { > > + spin_unlock(&partition->pt_mem_regions_lock); > > + return -ENOENT; > > + } > > > > /* Paranoia check */ > > if (region->start_uaddr != mem.userspace_addr || > > region->start_gfn != mem.guest_pfn || > > - region->nr_pages != HVPFN_DOWN(mem.size)) > > + region->nr_pages != HVPFN_DOWN(mem.size)) { > > + spin_unlock(&partition->pt_mem_regions_lock); > > return -EINVAL; > > + } > > + > > + hlist_del(®ion->hnode); > > > > - mshv_region_destroy(region); > > + spin_unlock(&partition->pt_mem_regions_lock); > > + > > + mshv_region_put(region); > > > > return 0; > > } > > @@ -1653,8 +1668,10 @@ static void destroy_partition(struct mshv_partition > > *partition) > > remove_partition(partition); > > > > hlist_for_each_entry_safe(region, n, &partition->pt_mem_regions, > > - hnode) > > - mshv_region_destroy(region); > > + hnode) { > > + hlist_del(®ion->hnode); > > + mshv_region_put(region); > > + } > > > > /* Withdraw and free all pages we deposited */ > > hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id); > > @@ -1852,6 +1869,7 @@ mshv_ioctl_create_partition(void __user *user_arg, > > struct device *module_dev) > > > > INIT_HLIST_HEAD(&partition->pt_devices); > > > > + spin_lock_init(&partition->pt_mem_regions_lock); > > INIT_HLIST_HEAD(&partition->pt_mem_regions); > > > > mshv_eventfd_init(partition); > > > > >
