On Fri, Sep 02, 2022, Xiaoyao Li wrote:
> On 8/26/2022 1:57 PM, Gerd Hoffmann wrote:
> >    Hi,
> > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. 
> > > E.g.,
> > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap
> > > [1].
> > 
> > Why is that a problem for a TD guest kernel?  Installing exception
> > handlers is done quite early in the boot process, certainly before any
> > userspace code runs.  So I think we should never see a syscall without
> > a #VE handler being installed.  /me is confused.
> > 
> > Or do you want tell me linux has no #VE handler?
> 
> The problem is not "no #VE handler" and Linux does have #VE handler. The
> problem is Linux doesn't want any (or certain) exception occurrence in
> syscall gap, it's not specific to #VE. Frankly, I don't understand the
> reason clearly, it's something related to IST used in x86 Linux kernel.

The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first 
instruction
at the SYSCALL entry point runs with a userspaced-controlled RSP.  With TDX, a
malicious hypervisor can induce a #VE on the SYSCALL page and thus get the 
kernel
to run the #VE handler with a userspace stack.

The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on 
#VE,
but ISTs are terrible because they don't play nice with re-entrancy (among other
reasons).  The RSP used for IST-based handlers is hardcoded, and so if a #VE
handler triggers another #VE at any point before IRET, the second #VE will 
clobber
the stack and hose the kernel.

It's possible to workaround this, e.g. change the IST entry at the very 
beginning
of the handler, but it's a maintenance burden.  Since the only reason to use an 
IST
is to guard against a malicious hypervisor, Linux decided it would be just as 
easy
and more beneficial to avoid unexpected #VEs due to unaccepted private pages 
entirely.

Reply via email to