On 22.07.25 05:45, Xiaoyao Li wrote:
On 6/20/2025 3:42 AM, Mathias Krause wrote:
KVM has a weird behaviour when a guest executes VMCALL on an AMD system
or VMMCALL on an Intel CPU. Both naturally generate an invalid opcode
exception (#UD) as they are just the wrong instruction for the CPU
given. But instead of forwarding the exception to the guest, KVM tries
to patch the guest instruction to match the host's actual hypercall
instruction. That is doomed to fail as read-only code is rather the
standard these days. But, instead of letting go the patching attempt and
falling back to #UD injection, KVM injects the page fault instead.
That's wrong on multiple levels. Not only isn't that a valid exception
to be generated by these instructions, confusing attempts to handle
them. It also destroys guest state by doing so, namely the value of CR2.
Sean attempted to fix that in KVM[1] but the patch was never applied.
Later, Oliver added a quirk bit in [2] so the behaviour can, at least,
conceptually be disabled. Paolo even called out to add this very
functionality to disable the quirk in QEMU[3]. So lets just do it.
A new property 'hypercall-patching=on|off' is added, for the very
unlikely case that there are setups that really need the patching.
However, these would be vulnerable to memory corruption attacks freely
overwriting code as they please. So, my guess is, there are exactly 0
systems out there requiring this quirk.
The default behavior is patching the hypercall for many years.
If you desire to change the default behavior, please at least keep it
unchanged for old machine version. i.e., introduce compat_property,
which sets KVMState->hypercall_patching_enabled to true.
Well, the thing is, KVM's patching is done with the effective
permissions of the guest which means, if the code in question isn't
writable from the guest's point of view, KVM's attempt to modify it will
fail. This failure isn't transparent for the guest as it sees a #PF
instead of a #UD, and that's what I'm trying to fix by disabling the quirk.
The hypercall patching was introduced in Linux commit 7aa81cc04781
("KVM: Refactor hypercall infrastructure (v3)") in v2.6.25. Until then
it was based on a dedicated hypercall page that was handled by KVM to
use the proper instruction of the KVM module in use (VMX or SVM).
Patching code was fine back then, but the introduction of DEBUG_RO_DATA
made the patching attempts fail and, ultimately, lead to Paolo handle
this with commit c1118b3602c2 ("x86: kvm: use alternatives for VMCALL
vs. VMMCALL if kernel text is read-only").
However, his change still doesn't account for the cross-vendor live
migration case (Intel<->AMD), which will still be broken, causing the
before mentioned bogus #PF, which will just lead to misleading Oops
reports, confusing the poor souls, trying to make sense of it.
IMHO, there is no valid reason for still having the patching in place as
the .text of non-ancient kernel's will be write-protected, making
patching attempts fail. And, as they fail with a #PF instead of #UD, the
guest cannot even handle them appropriately, as there was no memory
write attempt from its point of view. Therefore the default should be to
disable it, IMO. This won't prevent guests making use of the wrong
instruction from trapping, but, at least, now they'll get the correct
exception vector and can handle it appropriately.