Kevin Lawton wrote:
> And dealing with self modifying code with private code
> pages, and ... There is quite a bounty to be had
> by dumping all that for user code, at the expense
> of some more exceptions while running virtualized
> guest kernel code. I'm not sure what the average ratio
> is, but most of the important time on your CPU is taken
> up by user code. (not counting when the user is sitting
> there staring at a blinking cursor)
The problem is that if you keep the GDT where the
guest code expects it to be, then 9 out of 10 times
it'll be embedded inside the data section of the
guest OS code. Now the IDT is rather big (but still
only 1/2 of the size of a page), but the GDT is really
small usually. So what you have is that you need to
protect a page with up to 1/2 of sensitive code and
minimally 1/2 of potentially very important and oft-
accessed data. The performance impact depends on
the OS code, or even on the compiler.
A quick look at my linux kernel revealed that on
my system, the linux gdt shares the page with
a portion of the kernel symbol table. I don't
think that's all that bad because (I'm not sure
of this!!) I think that the symbol table is only
used when loading new modules, which is not
performance-critical. The IDT shares the page
with the start of the BSS:
c0232000 D idt_table
c0232800 A __bss_start
c0232800 b undone_count
c0232804 b handler_func
c0232808 b handler_info
c023280c b delay_at_last_interrupt
c0232810 b last_tsc_low
c0232814 b pci_bios_present
c0232820 b n.460
c0232840 b calibration_lock.548
c0232844 b trampoline_base
c0232848 b cached_APIC_ICR
c023284c b cached_APIC_ICR2
c0232850 b calibration_result
c0232860 b irq_2_pin
I do not recognise most of these, but they don't
look extremely critical. last_tsc_low and
delay_at_last_interrupt are accessed on every timer
interrupt. undone_count, handler_* are used for MTRRs
on P6. The rest appear to be used mostly at
initialisation time. But this is just linux, and
should we count on something like this ?
> I'm probably jumping ahead a little, but hey you got
> me started thinking about all this again. :^)
>
> So FWIW, here's a workaround.
>
> In your example, you have 9 entries times 8bytes/descriptor
> for a sum of 72bytes. Though, we have to protect the whole page,
> marking it with supervisor privilege. I'd prefer using that than
> not-present so we don't have to change things while in
> the monitor. Same idea.
Right, okay.
> So all accesses to that page from user land (the guest code)
> have to be intercepted. If they're in the GDT range, we
> give the guest what it wants to see. If not, we give it
> the adjoining data. The actual guest page data can be kept
> anywhere in a private page or whatever. The real data as seen
> at the page by ring0 code or by the CPU for segment descriptor
> accesses will by our own LDT. It will be located at the
> linear address the guest expects, but the guest will not
> see our stuff, just what it wants to see 'cause we would
> be feeding it that.
Okay, nice trick. But it does involve a lot of trapping,
which we should also try to avoid (slow...) See above.
> One of the cool things about running Linux as a guest,
> is that we can see what gains would be, by modifying and
> recompiling things so that the GDT would be aligned in
> it's own page, thus not incurring the exceptions for
> accesses to adjoining data. If it really made that
> much difference, I don't see why the diff couldn't
> make it's way into future Linux versions.
The order in which variables end up in the binary image
is very subtle. If you turn around two files on the
link commandline the effect might be quite different.
Just imagine what would happen if all of the mutexes/
semaphores ended up in the protected page !!! That'd
incure a huge amount of trashing.
> > Perhaps we should run a few experiments with compiling the
> > module with -fPIC and look whether the kernel module loader
> > chokes or not. If that would work then that'd save us
> > the effort of figuring out how to fit all those segments
> > together in the right way.
>
> I'm not very familiar with PIC code, other than I know it's
> used for dynamic libs. What I'm wondering about, is if
> PIC code is relocatable *once* as the dynamic loader loads the
> code into memory, and thereafter becomes static?
`-fpic'
Generate position-independent code (PIC) suitable for use in a
shared library, if supported for the target machine. Such code
accesses all constant addresses through a global offset table
(GOT). The dynamic loader resolves the GOT entries when the
program starts (the dynamic loader is not part of GNU CC; it is
part of the operating system). If the GOT size for the linked
executable exceeds a machine-specific maximum size, you get an
error message from the linker indicating that `-fpic' does not
work; in that case, recompile with `-fPIC' instead. (These
maximums are 16k on the m88k, 8k on the Sparc, and 32k on the m68k
and RS/6000. The 386 has no such limit.)
Position-independent code requires special support, and therefore
works only on certain machines. For the 386, GNU CC supports PIC
for System V but not for the Sun 386i. Code generated for the IBM
RS/6000 is always position-independent.
Hmmm, coming to think of it, that doesn't look like something the
linux module loader can handle.
> In this case,
> we are no better off unless we want to keep relocating it
> with our own built-in ELF re-loader.
That's not so hard, but I'd rather avoid it nonetheless :)
Ramon