On Tue, Feb 03, 2026 at 07:40:36AM -0800, Stanislav Kinsburskii wrote:
> On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote:
> > On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote:
> > > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> > > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> > > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii 
> > > > > > wrote:
> > > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam 
> > > > > > > wrote:
> > > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii 
> > > > > > > > wrote:
> > > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam 
> > > > > > > > > wrote:
> > > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav 
> > > > > > > > > > Kinsburskii wrote:
> > > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh 
> > > > > > > > > > > Rayabharam wrote:
> > > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav 
> > > > > > > > > > > > Kinsburskii wrote:
> > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to 
> > > > > > > > > > > > > the hypervisor during
> > > > > > > > > > > > > runtime and never withdraws them. This creates a 
> > > > > > > > > > > > > fundamental incompatibility
> > > > > > > > > > > > > with KEXEC, as these deposited pages remain 
> > > > > > > > > > > > > unavailable to the new kernel
> > > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes 
> > > > > > > > > > > > > upon kernel accessing
> > > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper 
> > > > > > > > > > > > > page lifecycle
> > > > > > > > > > > > > management is implemented.
> > > > > > > > > > > > 
> > > > > > > > > > > > Someone might want to stop all guest VMs and do a 
> > > > > > > > > > > > kexec. Which is valid
> > > > > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > No, it won't work and hypervsisor depostied pages won't 
> > > > > > > > > > > be withdrawn.
> > > > > > > > > > 
> > > > > > > > > > All pages that were deposited in the context of a guest 
> > > > > > > > > > partition (i.e.
> > > > > > > > > > with the guest partition ID), would be withdrawn when you 
> > > > > > > > > > kill the VMs,
> > > > > > > > > > right? What other deposited pages would be left?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > The driver deposits two types of pages: one for the guests 
> > > > > > > > > (withdrawn
> > > > > > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > > > > > withdrawn).
> > > > > > > > > See hv_call_create_partition, for example: it deposits pages 
> > > > > > > > > for the
> > > > > > > > > host partition.
> > > > > > > > 
> > > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in 
> > > > > > > > module_exit?
> > > > > > > > Also, can't we forcefully kill all running partitions in 
> > > > > > > > module_exit and
> > > > > > > > then reclaim memory? Would this help with kernel consistency
> > > > > > > > irrespective of userspace behavior?
> > > > > > > > 
> > > > > > > 
> > > > > > > It would, but this is sloppy and cannot be a long-term solution.
> > > > > > > 
> > > > > > > It is also not reliable. We have no hook to prevent kexec. So if 
> > > > > > > we fail
> > > > > > > to kill the guest or reclaim the memory for any reason, the new 
> > > > > > > kernel
> > > > > > > may still crash.
> > > > > > 
> > > > > > Actually guests won't be running by the time we reach our 
> > > > > > module_exit
> > > > > > function during a kexec. Userspace processes would've been killed by
> > > > > > then.
> > > > > > 
> > > > > 
> > > > > No, they will not: "kexec -e" doesn't kill user processes.
> > > > > We must not rely on OS to do graceful shutdown before doing
> > > > > kexec.
> > > > 
> > > > I see kexec -e is too brutal. Something like systemctl kexec is
> > > > more graceful and is probably used more commonly. In this case at least
> > > > we could register a reboot notifier and attempt to clean things up.
> > > > 
> > > > I think it is better to support kexec to this extent rather than
> > > > disabling it entirely.
> > > > 
> > > 
> > > You do understand that once our kernel is released to third parties, we
> > > can’t control how they will use kexec, right?
> > 
> > Yes, we can't. But that's okay. It is fine for us to say that only some
> > kexec scenarios are supported and some aren't (iff you're creating VMs
> > using MSHV; if you're not creating VMs all of kexec is supported).
> > 
> 
> Well, I disagree here. If we say the kernel supports MSHV, we must
> provide a robust solution. A partially working solution is not
> acceptable. It makes us look careless and can damage our reputation as a
> team (and as a company).

It won't if we call out upfront what is supported and what is not.

> 
> > > 
> > > This is a valid and existing option. We have to account for it. Yet
> > > again, L1VH will be used by arbitrary third parties out there, not just
> > > by us.
> > > 
> > > We can’t say the kernel supports MSHV until we close these gaps. We must
> > 
> > We can. It is okay say some scenarios are supported and some aren't.
> > 
> > All kexecs are supported if they never create VMs using MSHV. If they do
> > create VMs using MSHV and we implement cleanup in a reboot notifier at
> > least systemctl kexec and crashdump kexec would which are probably the
> > most common uses of kexec. It's okay to say that this is all we support
> > as of now.
> > 
> 
> I'm repeating myself, but I'll try to put it differently.
> There won't be any kernel core collected if a page was deposited. You're
> arguing for a lost cause here. Once a page is allocated and deposited,
> the crash kernel will try to write it into the core.

That's why we have to implement something where we attempt to destroy
partitions and reclaim memory (and BUG() out if that fails; which
hopefully should happen very rarely if at all). This should be *the*
solution we work towards. We don't need a temporary disable kexec
solution.

> 
> > Also, what makes you think customers would even be interested in enabling
> > our module in their kernel configs if it takes away kexec?
> > 
> 
> It's simple: L1VH isn't a host, so I can spin up new VMs instead of
> servicing the existing ones.

And what about the L2 VM state then? They might not be throwaway in all
cases.

> 
> Why do you think there won’t be customers interested in using MSHV in
> L1VH without kexec support?

Because they could already be using kexec for their servicing needs or
whatever. And no we can't just say "don't service these VMs just spin up
new ones".

Also, keep in mind that once L1VH is available in Azure, the distros
that run on it would be the same distros that run on all other Azure
VMs. There won't be special distros with a kernel specifically built for
L1VH. And KEXEC is generally enabled in distros. Distro vendors won't be
happy that they would need to publish a separate version of their image with
MSHV_ROOT enabled and KEXEC disabled because they wouldn't want KEXEC to
be disabled for all Azure VMs. Also, the customers will be confused why
the same distro doesn't work on L1VH.

Thanks,
Anirudh.


Reply via email to