svs: unmap pmap_kernel

Maxime Villard Fri, 12 Jan 2018 08:35:45 -0800

Here is a patch [1] that removes pmap_kernel from userland. The idea may be a
bit difficult to grasp, so I'll just draw the big picture.


Unmapping pmap_kernel from userland is a complicated business for several
reasons. The two main reasons are:

(1) The kernel stack needs to be mapped in userland. Yet, it contains secret
data that should not be made available to userland.
(2) In order to switch to the kernel page tables (during a user->kernel
transition), we need to have a TLS. But the TLS contains secret data that too
should not be made available to userland.

These two issues are solved as follows:

(1) A fake one-page-sized stack is added in pcpu_entry. The VA of this stack
is dynamically kentered into the last physical page of the LWP's kernel stack.
The stacks are changed in such a way that their last page can only contain a
trapframe structure. During a user->kernel transition, a trapframe is pushed
on the fake stack; then we switch %rsp to the real stack and continue execution
as usual. Here we don't need to copy the content of the fake stack into the new
stack, since the two VAs point to the same physical page. See this drawing [2],
kindergarten style. With this design the part of the kernel stack that
contains secrets is actually *unmapped* from userland.
(2) A User Thread Local Storage (UTLS) page is added in pcpu_area. Each CPU
puts there the address of the kernel pdir. A particular rsp0 is set there too,
because the syscall entry point is special and needs a different mechanism.

In this implementation everything is optimized to reduce the overhead. In the
end SVS_ENTER becomes:

        movq    SVS_UTLS+UTLS_KPDIRPA,%rax
        movq    %rax,%cr3
        movq    CPUVAR(KRSP0),%rsp

Which is pretty fast to execute compared to the total separation it provides.
The place where we kenter the fake stack into the real one is svs_lwp_switch,
and it is organized in such a way that we don't even need to flush the VA
from the TLB.

The only drawback is that we need to add a bunch of redundant values in
cpu_info; but again these values are computed and saved at boot time, so that
they don't need to be recomputed in each context switch or kernel<->user
transition.

After this change only the kernel image will need to be unmapped, and this can
be solved quickly.

Note: our handling of double faults (and NMIs a bit) has always been wrong, and
is even more wrong with SVS; this is in my todo list.

I will probably commit this patch soon. Of course, it is compatible with
KASLR.

Maxime

[1] http://m00nbsd.net/garbage/svs/stack.diff
[2] http://m00nbsd.net/garbage/svs/stack.png

svs: unmap pmap_kernel

Reply via email to