VIRTUALIZING THE GUEST OS PAGE TABLES
=====================================

To implement our virtualization framework, we need to
make heavy use of the CPU's native paging system.  We
use it to protect pages against being accessed by
the guest code, and to play other tricks.  The guest
OS page tables represent the way it expects to allocate, protect,
and map linear to physical memory when it is running
natively on your machine and at expected privilege levels.

Note that before we even begin execution of the guest environment,
our host kernel module allocates all of the physical memory
for the guest virtual machine.  This memory is non-paged
in the host OS to prevent conflicts (currently anyways).
As such, we can take some liberties with the paging
mechanisms, using paging faults as a mechamism for the
monitor to intervene when the guest is accessing data
which we need to virtualize.

Since we are virtualizing the guest OS code, changing
privilege levels it runs at, changing the physical addresses
at which pages reside, and using the paging protection
mechanism for other tricks, we can not just use the guest OS
page tables as-is.  We must use a modified version of
them, and be careful that these modifications are not
visibile by the guest OS.

What better mechanism to virtualize the guest page tables
than the paging unit!  We can use protection flags in the
page table entries to be notified when the guest OS attempts
an access to the page tables.  At this point, we can
feed it the data it expects (read), or maintain the effect
that it intended (write).  In this way, the guest never
senses our modifications.

When there is a change of a page directory or page
table entry by the guest OS, we look at the requested
physical page address, and can map this to the corresponding
memory allocated for the guest VM by the host OS module.

Our page fault handler must look at the address to which
an access generated the fault, and determine whether it
is a result of our monitor virtualizing system data structures
such as the page tables, descriptor tables, etc
(in which case the access is valid but we need to feed it
fake data), or the access was truly to a non-existent
or higher privilege level page (in which case we have to effect
a page fault in the guest OS).

One of the behavioral differences that results from us
"pushing" ring0 guest OS code down to ring3 to be run
within the VM environment, is that we no longer have the
dual protection page level access capabilities which the
guest OS expects.  The paging unit classifies all code
in rings{0,1,2} as being supervisor and can essentially
access all pages.  Code which runs in ring3 is classified
as user code and can only access pages at that level.
But if we push all rings of guest code to ring3, then
there is no longer access level distinction between
guest OS code and guest user code.  Roughly depicted,
it looks something like this:

  guest   monitor
    0--+    0
    1  |    1
    2  |    2
    3--+--->3

There are a couple fundamental approaches we can
take to solve this behavioral difference.  Briefly,
I describe them following.


APPROACH #1

For each set of page tables which is encountered while
running the guest OS, we could maintain two virtualized
sets of page tables.  A first set would represent the guest's
page tables biased for running the guest *user* app code within
our virtualized environment.  Since we'd be running ring3
code at ring3, this would be fairly straight-forward.
A second set would represent the guest's page tables
biased for running the guest *system* code within
our virtualized environment.  Since the guest's system
code expects to be able to access all pages, we
can mark system and user pages all as user-level
privilege in this set of page tables - except where
we protect otherwise to implement various virtualization tricks.

APPROACH #2

Rather than push guest system code down to ring3, we could push
it down to ring1 instead.  This would yield a privilege level
mapping such as the following:

  guest   monitor
    0--+    0
    1  +--->1
    2       2
    3------>3

As far as page protection goes, this would offer us an
environment, where we could use the page protection
mechanisms more natively, and which would only require
one private set of page tables per each guest set.  Since
x86 paging groups all CPLs of {0,1,2} as supervisor-mode,
the shift from 0 to 1 in the example above does not
affect privilege levels with respect to paging.

The question is, which other virtualization strategies
does this interfere with?  To determine this, we should
put together a general list of the hacks we've come
up with so far, and see if we can make them work when
the guest system code is running at ring1.  And
of course, discuss the performance tradeoffs of doing
as such.

Since it's related, if we do want to have supervisor
level code, access pages in a write-protectable way, we
have to kick on the CR0.WP flag to get this effect.
Otherwise supervisor code can stomp on any page regardless
of it's read/write flag.  This flag will already
be on for host OSes which use an efficient fork()
implementation, since you need it for an on-demand
copy-on-write strategy.  It can be saved/restored
during the host<->monitor switch, so it doesn't matter
much - just thought I'd mention it.


MAINTAINING THE VIRTUALIZED PAGE TABLES

Regardless of the method chosen, we still need to define
how the page tables are maintained.  Within a hypothetical
guest OS, there could be N active page tables, let's say
one for each running task.  Note that there may be one
or more common regions in the linear address spaces described
by these page tables.  This would be the case, for instance,
for situations where each application has it's own
mappings for a particular linear region, and the OS
code maintains a consistent set of mappings in a
different linear region.

When the guest OS attempts a switch of page tables
(generally associated with a hardware or software task
switch), the monitor will intervene and effect that
change in light of our virtualization strategies.
Again, we have some options to discuss.  In short,
we can either dump old page mappings and start anew
upon each reload of the PDBR, or try to be smart and
store multiple sets of page tables.

The simplest approach is to dump the old page table
mappings each time the guest requests a reload of
the PDBR register.  One could imagine that we could then
mark all the entries in the virtualized page directory as
being not present, so that we could dynamically build
the virtualized page tables.  Each time a new directory
entry was accessed, the page fault would be handled by
the monitor, which could in turn mark all the page table
entries not present except for the one it needed to build.
(we of course except the small region where our interrupt
and exception handlers have to exist)  Using this
technique, we could also dump our page mappings during
a privilege level transition.  If we chose to do this,
then the issues above regarding which privilege level
to push guest system code to, are moot since we just
rebuild the page tables dynamically according to the
effective CPL of the guest code.

A more complex approach would be to save virtualized
page table information across PDBR reloads.  The idea
here is that when the guest OS schedules a task who's
page tables are already virtualized and stored, we can
save a number of page faults and execution of associated
monitor code, which would otherwise be incurred from
the dynamic rebuilding of the page tables.  This technique
does generate some issues.  One is that it requires more
memory for storage of additional page tables.  It also
requires additional logic to keep track of the page tables,
and must properly maintain situations where parts of
multiple page tables are shared.

THE ACCESSED AND DIRTY BITS

If we maintain a private copy of page directories
and page tables, it is not enough to only monitor changes
in the guest's tables and modify a private copy accordingly.
The accessed and dirty bits (only the accessed bit in
the directory) allow the CPU to give feedback to the OS on
the use of page directory and page table entries, by updating
these fields.  We must therefore insure that we provide
coherence between these flags as they are updated in the
private tables by the CPU, and those that are in the tables provided
by the guest which we are virtualizing.  We certainly can't
or at least don't want to have to do this on an instruction
by instruction basis.

Two good points to make this update
are upon the guest's access to the page tables, and at a time
when we "dump" virtualized page mappings for a previous
set of page tables.  To allow the monitor a point of
intervention, when the guest accesses it's page tables,
we can mark these regions (where the guest actually stores
its tables, not the private ones we use) with page protections
such that the guest code will generate a fault; supervisor
privilege if we are pushing all guest code to ring3,
and not-present if we are pushing guest system code
to ring1 should do.  During the fault, our monitor will have
to complete the update from the A & D bits in our private
tables to those in the guest's tables.

Reply via email to