OK, I think we have developed a pretty good plan for
virtualizing many x86 features so far.
The following text should fill in some more voids.
-Kevin
TIME REFERENCE AND TIMER FACILITIES IN FREEMWARE
================================================
We should probably talk about timing in our
virtualization environment.
There are two main software components in our virtualization
strategy. We have (1) a user program component which communicates
to (2) a kernel module component through the normal kernel
interfaces like ioctl(), read(), write(), etc.
All or most of the device emulation (video board, hard drive,
keyboard, etc) will be done at the user program level, and
we'll make use of the standard C library interfaces and such
to implement these.
The reason I say most, is that for performance reasons,
parts of all of some particular devices such as the timer
and interrupt controller chips can likely be moved into
the monitor domain. As was talked about before this would
alleviate a lot of context switching between the host/guest
contexts. We don't have to do this kind of thing right
away. Though, it's worth pointing out that parts of
quite a few devices can be moved into the monitor. For
example the floppy controller could be done in the monitor,
the floppy drive in the user program. The VGA adapter
in the monitor, the CRT display in the user app. Etc. etc.
Anyways, so we need some kind of accurate time reference and
timer services from this virtualization framework. For example,
to emulate the CMOS RTC, you need to be notified once per second
so you can update the clock. Because of these needs, we need
to develop such a framework.
Our timer stuff has to relate very closely to the amount of
real execution time the guest code has. What we don't want
to use are time references based on the host OS system, as
those are highly dependent on system load and other factors.
Depending upon the guest code running, there may also be a
considerable amount of time spent in the monitor as part
of the virtualization implementation, for certain local
chunks of guest OS code. We should exclude this time if
at all possible, since it is not time when the guest OS
is really running, and it would skew the time reference.
So our approach could go something like this. Each time,
just before the monitor hands over execution to the guest
code, we take a snapshot of time, using the RDTSC instruction.
Linux even defines an asm macro for this. :^)
Upon the next time invocation of our monitor code (via
an interrupt or exception) we take a 2nd sampling using
the same instruction. Now we have an accurate time sample
of how long the guest code actually ran without intervention.
We pass this duration to the timer framework. If there are
requests from the device models to be notified given the elapsed
time, then we call them. If they live in the user app world,
then we return back to the user app, which sees this as a return
from the ioctl() call, and some fields are filled in, like how
long we ran for etc. I we were wicked perfectionists, we could
subtract off the number of cycles it takes to get the guest code
started again, and for the exception to occur from our RDTSC
values.
Of course, guest code we run at any one time could conceivably
not invoke the virtualization monitor, before our next device
model requests being notified. The next bounding event would
be caused by a hardware interrupt redirect. Each host OS
can have set the IRQ0 timer to interrupt at a particular rate,
but let's say it's 100Hz or every 0.01 seconds. Let's say a
device model wants to be woken at say 0.005 seconds, and that
some guest code runs which is not naughty, and doesn't invoke
the monitor during the next user process time quantum.
So if we wanted highly accurate timing, we need a mechanism for
interrupting us in the middle. Fortunately, the built-in
APIC on the Pentium has a timer based on CPU clock speed which
can do this. It can be programmed to either periodic or one-shot
mode. (thanks to one of the developers for suggesting use of
this timer facility)
If we saw this condition, we could set the APIC timer
to go off at the equivalence of 0.005 seconds, and our monitor
will be notified right on the money.
Other tricks like temporarily reprogramming the PIT or the CMOS timer
for
a finer grained interrupt during that one quantum could be
used as a back-up plan for CPUs without the APIC timer capability.
For these CPUs, rather than getting a time reference by way of
reading the time-stamp-counter with RDTSC, we could read the
PIT counter register. This is not as high-resolution, but perhaps
functional enough.
I suppose, for starters we could declare the resolution of our
timer facilities to be, at best, the interval of the host OS's periodic
interrupt rate. :^)
To tie this together with the FreeMWare code we have already,
let's look at how this plays out for another contrived example.
Again, the host OS uses a 0.01 second periodic interrupt. And let's
say the next interrupt required is at 0.035 seconds. The user app
code component would probably look something like this:
...
s.run_for = 0.035 seconds
ioctl(fd, RUN_FOR_N_SECONDS, &s);
if (s.timer_requests_satisfied) {
// call all user level device emulation callbacks which
// have a timer which fired off (callbacks)
}
...
And on the kernel module side:
...
again:
host2monitor();
switch (monitor_info.ret_because) {
case IRQ:
soft_int(IRQ vector);
if (need_sched) {
schedule();
goto again; // no need to return to user app
case TIMER:
s.timer_requests_satisfied = 1;
// fill in other timer info for the user app
return(0); // return to user app because of timer request
complete
}
Hope this helps to visualize things.
So far all time reference has been relative to the execution of
guest code. This is the accurate way to make things respond to
the guest code properly. There are however, things which are
better tied to the host OS time reference.
Let's pick on the VGA emulation. There really are two parts to
it. The hardware adapter emulation is the first. It needs to
live in the time reference of the guest code for accurate emulation.
It does not care if there is a CRT attached to it or not, or in other
words whether you actually view the output or not. The emulation
of spewing the frame buffer output to your CRT can be done in
any time reference. This will be implemented by using a GUI library,
on a lot of platforms X11, and at the user application level. You
might want to refresh the output every so often, if it is updated,
but not too often otherwise you'll bog the system down.
In this scenario, we are better off using timing facilities of
the host OS. It's probably better to move this function off
into a separate thread/process. There are other device models
which are candidates for this sort of separation. We can look
into this more as we go.
-Kevin