On Wed, Feb 04, 2026 at 05:33:29AM +0000, Anirudh Rayabharam wrote: > On Tue, Feb 03, 2026 at 11:42:58AM -0800, Stanislav Kinsburskii wrote: > > On Tue, Feb 03, 2026 at 04:46:03PM +0000, Anirudh Rayabharam wrote: > > > On Tue, Feb 03, 2026 at 07:40:36AM -0800, Stanislav Kinsburskii wrote: > > > > On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote: > > > > > On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote: > > > > > > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote: > > > > > > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii > > > > > > > wrote: > > > > > > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam > > > > > > > > wrote: > > > > > > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav > > > > > > > > > Kinsburskii wrote: > > > > > > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh > > > > > > > > > > Rayabharam wrote: > > > > > > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav > > > > > > > > > > > Kinsburskii wrote: > > > > > > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh > > > > > > > > > > > > Rayabharam wrote: > > > > > > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav > > > > > > > > > > > > > Kinsburskii wrote: > > > > > > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh > > > > > > > > > > > > > > Rayabharam wrote: > > > > > > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, > > > > > > > > > > > > > > > Stanislav Kinsburskii wrote: > > > > > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages > > > > > > > > > > > > > > > > to the hypervisor during > > > > > > > > > > > > > > > > runtime and never withdraws them. This creates > > > > > > > > > > > > > > > > a fundamental incompatibility > > > > > > > > > > > > > > > > with KEXEC, as these deposited pages remain > > > > > > > > > > > > > > > > unavailable to the new kernel > > > > > > > > > > > > > > > > loaded via KEXEC, leading to potential system > > > > > > > > > > > > > > > > crashes upon kernel accessing > > > > > > > > > > > > > > > > hypervisor deposited pages. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until > > > > > > > > > > > > > > > > proper page lifecycle > > > > > > > > > > > > > > > > management is implemented. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Someone might want to stop all guest VMs and do a > > > > > > > > > > > > > > > kexec. Which is valid > > > > > > > > > > > > > > > and would work without any issue for L1VH. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > No, it won't work and hypervsisor depostied pages > > > > > > > > > > > > > > won't be withdrawn. > > > > > > > > > > > > > > > > > > > > > > > > > > All pages that were deposited in the context of a > > > > > > > > > > > > > guest partition (i.e. > > > > > > > > > > > > > with the guest partition ID), would be withdrawn when > > > > > > > > > > > > > you kill the VMs, > > > > > > > > > > > > > right? What other deposited pages would be left? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The driver deposits two types of pages: one for the > > > > > > > > > > > > guests (withdrawn > > > > > > > > > > > > upon gust shutdown) and the other - for the host itself > > > > > > > > > > > > (never > > > > > > > > > > > > withdrawn). > > > > > > > > > > > > See hv_call_create_partition, for example: it deposits > > > > > > > > > > > > pages for the > > > > > > > > > > > > host partition. > > > > > > > > > > > > > > > > > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in > > > > > > > > > > > module_exit? > > > > > > > > > > > Also, can't we forcefully kill all running partitions in > > > > > > > > > > > module_exit and > > > > > > > > > > > then reclaim memory? Would this help with kernel > > > > > > > > > > > consistency > > > > > > > > > > > irrespective of userspace behavior? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It would, but this is sloppy and cannot be a long-term > > > > > > > > > > solution. > > > > > > > > > > > > > > > > > > > > It is also not reliable. We have no hook to prevent kexec. > > > > > > > > > > So if we fail > > > > > > > > > > to kill the guest or reclaim the memory for any reason, the > > > > > > > > > > new kernel > > > > > > > > > > may still crash. > > > > > > > > > > > > > > > > > > Actually guests won't be running by the time we reach our > > > > > > > > > module_exit > > > > > > > > > function during a kexec. Userspace processes would've been > > > > > > > > > killed by > > > > > > > > > then. > > > > > > > > > > > > > > > > > > > > > > > > > No, they will not: "kexec -e" doesn't kill user processes. > > > > > > > > We must not rely on OS to do graceful shutdown before doing > > > > > > > > kexec. > > > > > > > > > > > > > > I see kexec -e is too brutal. Something like systemctl kexec is > > > > > > > more graceful and is probably used more commonly. In this case at > > > > > > > least > > > > > > > we could register a reboot notifier and attempt to clean things > > > > > > > up. > > > > > > > > > > > > > > I think it is better to support kexec to this extent rather than > > > > > > > disabling it entirely. > > > > > > > > > > > > > > > > > > > You do understand that once our kernel is released to third > > > > > > parties, we > > > > > > can’t control how they will use kexec, right? > > > > > > > > > > Yes, we can't. But that's okay. It is fine for us to say that only > > > > > some > > > > > kexec scenarios are supported and some aren't (iff you're creating VMs > > > > > using MSHV; if you're not creating VMs all of kexec is supported). > > > > > > > > > > > > > Well, I disagree here. If we say the kernel supports MSHV, we must > > > > provide a robust solution. A partially working solution is not > > > > acceptable. It makes us look careless and can damage our reputation as a > > > > team (and as a company). > > > > > > It won't if we call out upfront what is supported and what is not. > > > > > > > > > > > > > > > > > > > This is a valid and existing option. We have to account for it. Yet > > > > > > again, L1VH will be used by arbitrary third parties out there, not > > > > > > just > > > > > > by us. > > > > > > > > > > > > We can’t say the kernel supports MSHV until we close these gaps. We > > > > > > must > > > > > > > > > > We can. It is okay say some scenarios are supported and some aren't. > > > > > > > > > > All kexecs are supported if they never create VMs using MSHV. If they > > > > > do > > > > > create VMs using MSHV and we implement cleanup in a reboot notifier at > > > > > least systemctl kexec and crashdump kexec would which are probably the > > > > > most common uses of kexec. It's okay to say that this is all we > > > > > support > > > > > as of now. > > > > > > > > > > > > > I'm repeating myself, but I'll try to put it differently. > > > > There won't be any kernel core collected if a page was deposited. You're > > > > arguing for a lost cause here. Once a page is allocated and deposited, > > > > the crash kernel will try to write it into the core. > > > > > > That's why we have to implement something where we attempt to destroy > > > partitions and reclaim memory (and BUG() out if that fails; which > > > hopefully should happen very rarely if at all). This should be *the* > > > solution we work towards. We don't need a temporary disable kexec > > > solution. > > > > > > > No, the solution is to preserve the shared state and pass it over via KHO. > > Okay, then work towards it without doing temporary KEXEC disable. We can > call out that kexec is not supported until then. Disabling KEXEC is too > intrusive. >
What do you mean by "too intrusive"? The change if local to driver's Kconfig. There are no verbal "callouts" in upstream Linux - that's exactly what Kconfig is used for. Once the proper solution is implemented, we can remove the restriction. > Is there any precedent for this? Do you know if any driver ever disabled > KEXEC this way? > No, but there is no other similar driver like this one. Why does it matter though? > > > > > > > > > > > Also, what makes you think customers would even be interested in > > > > > enabling > > > > > our module in their kernel configs if it takes away kexec? > > > > > > > > > > > > > It's simple: L1VH isn't a host, so I can spin up new VMs instead of > > > > servicing the existing ones. > > > > > > And what about the L2 VM state then? They might not be throwaway in all > > > cases. > > > > > > > L2 guest can (and likely will) be migrated fromt he old L1VH to the new > > one. > > And this is most likely the current scenario customers are using. > > > > > > > > > > Why do you think there won’t be customers interested in using MSHV in > > > > L1VH without kexec support? > > > > > > Because they could already be using kexec for their servicing needs or > > > whatever. And no we can't just say "don't service these VMs just spin up > > > new ones". > > > > > > > Are you speculating or know for sure? > > It's a reasonable assumption that people are using kexec for servicing. > Again, using kexec for servicing is not supported: why pretending it is? > > > > > Also, keep in mind that once L1VH is available in Azure, the distros > > > that run on it would be the same distros that run on all other Azure > > > VMs. There won't be special distros with a kernel specifically built for > > > L1VH. And KEXEC is generally enabled in distros. Distro vendors won't be > > > happy that they would need to publish a separate version of their image > > > with > > > MSHV_ROOT enabled and KEXEC disabled because they wouldn't want KEXEC to > > > be disabled for all Azure VMs. Also, the customers will be confused why > > > the same distro doesn't work on L1VH. > > > > > > > I don't think distro happiness is our concern. They already build custom > > If distros are not happy they won't package this and consequently > nobody will use it. > Could you provide an example of such issues in the past? > > versions for Azure. They can build another custom version for L1VH if > > needed. > > We should at least check if they are ready to do this. > This is a labor intrusive and long-term check. Unless there is a solid evidence that they won't do it, I don't see the point in doing this. Thanks, Stanislav > Thanks, > Anirudh. > > > > > Anyway, I don't see the point in continuing this discussion. All points > > have been made, and solutions have been proposed. > > > > If you can come up with something better in the next few days, so we at > > least have a chance to get it merged in the next merge window, great. If > > not, we should explicitly forbid the unsupported feature and move on. > > > > Thanks, > > Thanks, > > Stanislav > > > > > Thanks, > > > Anirudh.
