On Tue, Feb 03, 2026 at 10:34:28AM +0530, Anirudh Rayabharam wrote: > On Mon, Feb 02, 2026 at 11:18:27AM -0800, Stanislav Kinsburskii wrote: > > On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote: > > > On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote: > > > > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote: > > > > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote: > > > > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote: > > > > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii > > > > > > > wrote: > > > > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam > > > > > > > > wrote: > > > > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav > > > > > > > > > Kinsburskii wrote: > > > > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh > > > > > > > > > > Rayabharam wrote: > > > > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav > > > > > > > > > > > Kinsburskii wrote: > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the > > > > > > > > > > > > hypervisor during > > > > > > > > > > > > runtime and never withdraws them. This creates a > > > > > > > > > > > > fundamental incompatibility > > > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable > > > > > > > > > > > > to the new kernel > > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes > > > > > > > > > > > > upon kernel accessing > > > > > > > > > > > > hypervisor deposited pages. > > > > > > > > > > > > > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper > > > > > > > > > > > > page lifecycle > > > > > > > > > > > > management is implemented. > > > > > > > > > > > > > > > > > > > > > > Someone might want to stop all guest VMs and do a kexec. > > > > > > > > > > > Which is valid > > > > > > > > > > > and would work without any issue for L1VH. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > No, it won't work and hypervsisor depostied pages won't be > > > > > > > > > > withdrawn. > > > > > > > > > > > > > > > > > > All pages that were deposited in the context of a guest > > > > > > > > > partition (i.e. > > > > > > > > > with the guest partition ID), would be withdrawn when you > > > > > > > > > kill the VMs, > > > > > > > > > right? What other deposited pages would be left? > > > > > > > > > > > > > > > > > > > > > > > > > The driver deposits two types of pages: one for the guests > > > > > > > > (withdrawn > > > > > > > > upon gust shutdown) and the other - for the host itself (never > > > > > > > > withdrawn). > > > > > > > > See hv_call_create_partition, for example: it deposits pages > > > > > > > > for the > > > > > > > > host partition. > > > > > > > > > > > > > > Hmm.. I see. Is it not possible to reclaim this memory in > > > > > > > module_exit? > > > > > > > Also, can't we forcefully kill all running partitions in > > > > > > > module_exit and > > > > > > > then reclaim memory? Would this help with kernel consistency > > > > > > > irrespective of userspace behavior? > > > > > > > > > > > > > > > > > > > It would, but this is sloppy and cannot be a long-term solution. > > > > > > > > > > > > It is also not reliable. We have no hook to prevent kexec. So if we > > > > > > fail > > > > > > to kill the guest or reclaim the memory for any reason, the new > > > > > > kernel > > > > > > may still crash. > > > > > > > > > > Actually guests won't be running by the time we reach our module_exit > > > > > function during a kexec. Userspace processes would've been killed by > > > > > then. > > > > > > > > > > > > > No, they will not: "kexec -e" doesn't kill user processes. > > > > We must not rely on OS to do graceful shutdown before doing > > > > kexec. > > > > > > I see kexec -e is too brutal. Something like systemctl kexec is > > > more graceful and is probably used more commonly. In this case at least > > > we could register a reboot notifier and attempt to clean things up. > > > > > > I think it is better to support kexec to this extent rather than > > > disabling it entirely. > > > > > > > You do understand that once our kernel is released to third parties, we > > can’t control how they will use kexec, right? > > Yes, we can't. But that's okay. It is fine for us to say that only some > kexec scenarios are supported and some aren't (iff you're creating VMs > using MSHV; if you're not creating VMs all of kexec is supported). >
Well, I disagree here. If we say the kernel supports MSHV, we must provide a robust solution. A partially working solution is not acceptable. It makes us look careless and can damage our reputation as a team (and as a company). > > > > This is a valid and existing option. We have to account for it. Yet > > again, L1VH will be used by arbitrary third parties out there, not just > > by us. > > > > We can’t say the kernel supports MSHV until we close these gaps. We must > > We can. It is okay say some scenarios are supported and some aren't. > > All kexecs are supported if they never create VMs using MSHV. If they do > create VMs using MSHV and we implement cleanup in a reboot notifier at > least systemctl kexec and crashdump kexec would which are probably the > most common uses of kexec. It's okay to say that this is all we support > as of now. > I'm repeating myself, but I'll try to put it differently. There won't be any kernel core collected if a page was deposited. You're arguing for a lost cause here. Once a page is allocated and deposited, the crash kernel will try to write it into the core. > Also, what makes you think customers would even be interested in enabling > our module in their kernel configs if it takes away kexec? > It's simple: L1VH isn't a host, so I can spin up new VMs instead of servicing the existing ones. Why do you think there won’t be customers interested in using MSHV in L1VH without kexec support? Thanks, Stanislav > Thanks, > Anirudh. > > > not depend on user space to keep the kernel safe. > > > > Do you agree? > > > > Thanks, > > Stanislav > > > > > > > > > > > Also, why is this sloppy? Isn't this what module_exit should be > > > > > doing anyway? If someone unloads our module we should be trying to > > > > > clean everything up (including killing guests) and reclaim memory. > > > > > > > > > > > > > Kexec does not unload modules, but it doesn't really matter even if it > > > > would. > > > > There are other means to plug into the reboot flow, but neither of them > > > > is robust or reliable. > > > > > > > > > In any case, we can BUG() out if we fail to reclaim the memory. That > > > > > would > > > > > stop the kexec. > > > > > > > > > > > > > By killing the whole system? This is not a good user experience and I > > > > don't see how can this be justified. > > > > > > It is justified because, as you said, once we reach that failure we can > > > no longer guarantee integrity. So BUG() makes sense. This BUG() would > > > cause the system to go for a full reboot and restore integrity. > > > > > > > > > > > > This is a better solution since instead of disabling KEXEC outright: > > > > > our > > > > > driver made the best possible efforts to make kexec work. > > > > > > > > > > > > > How an unrealiable feature leading to potential system crashes is better > > > > that disabling kexec outright? > > > > > > Because there are ways of using the feature reliably. What if someone > > > has MSHV_ROOT enabled but never start a VM? (Just because someone has our > > > driver enabled in the kernel doesn't mean they're using it.) What about > > > crash > > > dump? > > > > > > It is far better to support some of these scenarios and be unreliable in > > > some corner cases rather than disabling the feature completely. > > > > > > Also, I'm curious if any other driver in the kernel has ever done this > > > (force disable KEXEC). > > > > > > > > > > > It's a complete opposite story for me: the latter provides a limited, > > > > but robust functionality, while the former provides an unreliable and > > > > unpredictable behavior. > > > > > > > > > > > > > > > > There are two long-term solutions: > > > > > > 1. Add a way to prevent kexec when there is shared state between > > > > > > the hypervisor and the kernel. > > > > > > > > > > I honestly think we should focus efforts on making kexec work rather > > > > > than finding ways to prevent it. > > > > > > > > > > > > > There is no argument about it. But until we have it fixed properly, we > > > > have two options: either disable kexec or stop claiming we have our > > > > driver up and ready for external customers. Giving the importance of > > > > this driver for current projects, I believe the better way would be to > > > > explicitly limit the functionality instead of postponing the > > > > productization of the driver. > > > > > > It is okay to claim our driver as ready even if it doesn't support all > > > kexec cases. If we can support the common cases such as crash dump and > > > maybe kexec based servicing (pretty sure people do systemctl kexec and > > > not kexec -e for this with proper teardown) we can claim that our driver > > > is ready for general use. > > > > > > Thanks, > > > Anirudh.
