Avi Kivity wrote: > I am currently investigating a problem with the a guest running Linux > malfunctioning in the NMI watchdog code. The problem is that we don't > handle NMI delivery mode for the local APIC LINT0 pin; instead we > expect ExtInt deliver mode or that the line is disabled completely. > In addition the i8254 timer is tied to the BSP, while in this case the > timer can broadcast to all vcpus. > > There is some code that tries to second-guess the guest and provide it > the inputs it sees, but this is fragile. The only way to get reliable > operation is to emulate the hardware fully. > > Now I'd much rather do that in userspace, since it's a lot of > sensitive work. I'll enumerate below the general motivation, > advantages and disadvantages, and a plan for moving forward. > > Motivation > ========== > > The original motivation for moving the PIC and IOAPIC into the kernel > was performance, especially for assigned devices. Both devices are > high interaction since they deal with interrupts; practically after > every interrupt there is either a PIC ioport write, or an APIC bus > message, both signalling an EOI operation. Moving the PIT into the > kernel allowed us to catch up with missed timer interrupt injections, > and speeded up guests which read the PIT counters (e.g. tickless > guests). > > However, modern guests running on modern qemu use MSI extensively; > both virtio and assigned devices now have MSI support; and the > planned VFIO only supports kernel delivery via MSI anyway; line based > interrupts will need to be mediated by userspace. > > The only high frequency non-MSI interrupt sources remaining are the > various timers; and the default one, HPET, is in userspace (and having > its own scaling problems as a result). So in theory we can move PIC, > IOAPIC, and PIT support to userspace and not lose much performance. > > Moving the implementation to userspace allows us more flexibility, and > more consistency in the implementation of timekeeping for the various > clock chips; it becomes easier to follow the nuances of real hardware > in this area. > > Interestingly, while the IOAPIC/PIC code was written we proposed > making it independent of the local APIC; had we done so, the move > would have been much easier (simply dropping the existing code). > > > Advantages of a move > ==================== > > 1. Reduced kernel footprint > > Good for security, and allows fixing bugs without reboots. > > 2. Centralized timekeeping > > Instead of having one solution for PIT timekeeping, and another for > RTC and HPET timekeeping, we can have all timer chips in userspace. > The local APIC timer still needs to be in the kernel - it is much too > high bandwidth to be in userspace; but on the other hand it is very > different from the other timer chips. > > 3. Flexibility > > Easier to have wierd board layouts (multiple IOAPICs, etc.). Not a > very strong advantage. > > Disadvantages > ============= > > 1. Still need to keep the old code around for a long while > > We can't just rip it out - old userspace depends on it. So the > security advantages are only with cooperating userspace, and the > other advantages only show up. > > 2. Need to bring the qemu code up to date > > The current qemu ioapic code lags some way behind the kernel; also > need PIT timekeeping > > 3. May need kernel support for interval-timer-follows-thread > > Currently the timekeeping code has an optimization which causes the > hrtimer that models the PIT to follow the BSP (which is most likely to > receive the interrupt); this reduces cpu cross-talk. > > I don't think the kernel interval timer code has such an optimization; > we may need to implement it. > > 4. Much churn > > This is a lot of work. > > 5. Risk > > We may find out after all this is implemented that performance is not > acceptable and all the work will have to be dropped. > >
Besides VF IO interrupt and timer interrupt introduced performance overhead risk, EOI message deliver from lapic to ioapic, which becomes in user land now, may have potential scalability issue. For example, if we have a 64 VCPU guest, if each vcpu has 1khz interrupt (or ipi), the EOI from guest will normally have to involve ioapic module for clearance in 64khz which may have long lock contentio. you may reduce the involvement of ioapic eoi by tracking ioapic pin <-> vector map in kernel, but not sure if it is clean enough.