Ulrich Weigand wrote:
>
> Bochs Software Company <[EMAIL PROTECTED]> wrote:
>
> [ Sorry for the long delay in replying to this post; I've finally
> found some time now to address a few of the points you made. ]
Hey I seem to have missed the original message completely... well,
here goes.
> [snip]
> > APPROACH #2
> >
> > Rather than push guest system code down to ring3, we could push
> > it down to ring1 instead. This would yield a privilege level
> > mapping such as the following:
> >
> > guest monitor
> > 0--+ 0
> > 1 +--->1
> > 2 2
> > 3------>3
I've been walking around with this idea for quite a while, actually...
I did have something against it but I can't remember what. :(
> > As far as page protection goes, this would offer us an
> > environment, where we could use the page protection
> > mechanisms more natively,
Also meaning that the supervisor trick in the TLB hack doesn't work.
The problem is that have a sort of inverse problem -- we want to
use the MMU as part of our virtualisation strategy, but it's generally
supervisor-code that needs special attention and not user-level
code. So we'd actually like to run user code in ring 1 and
supervisor code in ring 3 :)
Okay, I know it's stupid. Just trying to make a point.
> > and which would only require
> > one private set of page tables per each guest set. Since
> > x86 paging groups all CPLs of {0,1,2} as supervisor-mode,
> > the shift from 0 to 1 in the example above does not
> > affect privilege levels with respect to paging.
>
> Eh, but how to you propose to prevent the guest OS from
> corrupting monitor code/data pages?
>
> While guest code is
> running, at least the IDT and the interrupt handlers
> must be mapped into the virtual address space. This means
> that these pages can be modified by any code running in
> supervisor mode, which would include the guest OS ...
Segmentation should help. We could use a scheme which
goes something like this:
The guest code runs in a segment that is completely flat,
except that there is a piece "off the top" of the segment
(i.e., it is not 4GB but 4GB-n*4k). This piece is used
for the monitor private data. The monitor does run in
a real flat segment, so it can access its private data
(which should be mapped there in a fixed position ---
this can be done using a linker script, which tells the
linker to relocate but not to load data at a fixed position
in memory. It's tricky but I've done it before, it should
work (though I don't know what implications linux modules
have)). Things like the IDT have "double mappings":
read-only mappings in the guest space and r/w mappings
in the monitor private space (note I'm assuming 486+
here, 386 doesn't honor the WP bit in supervisor mode.)
Now, the trick is that the monitor should always be
activated by a trap. As you can specify the code segment
in the IDT, this means that the CPU will always switch
to the correct segment when the monitor starts running.
Leaving the monitor should go through IRET, which restores
the guest segment. When the monitor's running, it can
access its private data. If the guest tries to do this
though it'll incur a GPF (or something like that),
which'll trap to the monitor.
There's only one problem: the piece off the top of
the address space can't be used directly by the guest
OS, but needs to be "emulated". How terrible this is
depends on what the guest OS looks like. If it is
identity mapped then we'd have to trap on it anyway,
because the top 20MB of the physical address space
are reserved for memory-mapped I/O (such as the APICs).
I can think of a different solution to the problem,
but it qould require very very careful coding of the
guest monitor code, and probably other problems as
well (hard to say anything about performance: it may
be worth it or not.) What we could do is code the
guest OS monitor code in such a way, that it doesn't
actually access any memory that the guest OS isn't
allowed to access either. In other words, for example
the guest code can't write to the IDT (so it'd be mapped
read-only), so that would mean that the guest monitor
can't write to it either. Then, the actions of the
guest monitor would be restricted to directing the
virtualisation process. Whenever a protected
part of the memory needs to be accessed, it switches
back to host OS context and lets it do it (the whole
procedure can be automated by programming the pagefault
handler in such a way that it can automatically
recognise pagefaults from withing the guest monitor
and forward them correctly to the host context).
The question is how often does a protected part of
memory need to be changed ? and how much do we
win by "simplifying" (actually, it sounds quit
complicated to do it right) the guest monitor ?
The IDT and GDT can certainly afford the overhead,
pagetables can be more of a problem (but also not
all that critical, I think). What I'm worrying
more about is how we'd keep the guest code from
corrupting the data that we need to switch back
to the host context -- because it cannot be read-only.
> [snip]
> > A more complex approach would be to save virtualized
> > page table information across PDBR reloads. The idea
> > here is that when the guest OS schedules a task who's
> > page tables are already virtualized and stored, we can
> > save a number of page faults and execution of associated
> > monitor code, which would otherwise be incurred from
> > the dynamic rebuilding of the page tables.
>
> Of course, you need to take into account that the page
> tables might be modified while they are *not* currently
> active ... Thus, you can't simply re-use the old monitor
> page tables if the guest reloads a PDBR value that we've
> already seen in the past.
This is a problem. Parsing the page tables on every guest
context switch doesn't sound particularly exciting, though...
Of course, we don't need a detailed analysis --- we just need
to look whether any of the page table pages are dirty. Only
if that is the case we need the rest.
Ramon