Ramon van Handel wrote:
> For instance, in my code, I have
>
> PUBLIC descTable(GDT, NOGDTENTRIES) {
> /* ... */
> }
>
> NOGDTENTRIES is something like 9, IIRC. This is just part of
> the binary image of the kernel, and takes up a fixed amount of
> room in between the rest of the data. This means we cannot
> overrun the GDT size !!! And we'll probably need extra
> segments ourselves for the monitor, so we may get stuck here.
A good point indeed. BTW, the reason I'm trying so hard
to eliminate the necessity for prescanning various instructions
is this. If we're running a fairly clean guest OS, at
the point when we can determine we have a prescan list
of 0 instructions, we can drop prescanning altogether for
the user code. If you can dump prescanning, then you can
also omit a lot of the other framework, like monitoring
out of page jumps since all code becomes trusted.
And dealing with self modifying code with private code
pages, and ... There is quite a bounty to be had
by dumping all that for user code, at the expense
of some more exceptions while running virtualized
guest kernel code. I'm not sure what the average ratio
is, but most of the important time on your CPU is taken
up by user code. (not counting when the user is sitting
there staring at a blinking cursor)
I'm probably jumping ahead a little, but hey you got
me started thinking about all this again. :^)
So FWIW, here's a workaround.
In your example, you have 9 entries times 8bytes/descriptor
for a sum of 72bytes. Though, we have to protect the whole page,
marking it with supervisor privilege. I'd prefer using that than
not-present so we don't have to change things while in
the monitor. Same idea.
So all accesses to that page from user land (the guest code)
have to be intercepted. If they're in the GDT range, we
give the guest what it wants to see. If not, we give it
the adjoining data. The actual guest page data can be kept
anywhere in a private page or whatever. The real data as seen
at the page by ring0 code or by the CPU for segment descriptor
accesses will by our own LDT. It will be located at the
linear address the guest expects, but the guest will not
see our stuff, just what it wants to see 'cause we would
be feeding it that.
If we needed a huge GDT, we could protect multiple pages
in this way. It's sort of an "overlay" technique. This
way our GDT can be completely of a different size, yet
remain located at the right linear address.
We should also keep in mind that loading a segment
register from a descriptor updates the accessed bit
if not set already. Depending on how we implement things
(converting the guest descriptors to our own adjusted
descriptors one at a time or a bunch at one time) so that
the natural x86 descriptor loading can take place, we
may or may not have to also protect against reads to the
GDT page. At this point we would have to make the
guest GDT entries consistent with the monitor's GDT
entries Accessed bits. We'll need to look into this
more. But the extra exceptions argument may be a moot
point.
One of the cool things about running Linux as a guest,
is that we can see what gains would be, by modifying and
recompiling things so that the GDT would be aligned in
it's own page, thus not incurring the exceptions for
accesses to adjoining data. If it really made that
much difference, I don't see why the diff couldn't
make it's way into future Linux versions.
> Perhaps we should run a few experiments with compiling the
> module with -fPIC and look whether the kernel module loader
> chokes or not. If that would work then that'd save us
> the effort of figuring out how to fit all those segments
> together in the right way.
I'm not very familiar with PIC code, other than I know it's
used for dynamic libs. What I'm wondering about, is if
PIC code is relocatable *once* as the dynamic loader loads the
code into memory, and thereafter becomes static? In this case,
we are no better off unless we want to keep relocating it
with our own built-in ELF re-loader.
Normally compiled code may not be a problem anyways. You just
have to maintain the proper base-to-start-of-code offset. It's
a private descriptor also, and you don't care if the range
from base to start-of-monitor-code overlaps other memory because
you don't access it.
-Kevin