On 04/12/2018 07:15 AM, David Gibson wrote: > On Wed, Jan 17, 2018 at 03:39:46PM +0100, Cédric Le Goater wrote: >> On 01/17/2018 12:10 PM, Benjamin Herrenschmidt wrote: >>> On Wed, 2018-01-17 at 10:18 +0100, Cédric Le Goater wrote: >>>>>>> Also, have we decided how the process of switching between XICS and >>>>>>> XIVE will work vs. CAS ? >>>>>> >>>>>> That's how it is described in the architecture. The current choice is >>>>>> to create both XICS and XIVE objects and choose at CAS which one to >>>>>> use. It relies today on the capability of the pseries machine to >>>>>> allocate IRQ numbers for both interrupt controller backends. These >>>>>> patches have been merged in QEMU. >>>>>> >>>>>> A change of interrupt mode results in a reset. The device tree is >>>>>> populated accordingly and the ICPs are switched for the model in >>>>>> use. >>>>> >>>>> For KVM we need to only instanciate one of them though. >>>> >>>> Hmm, >>>> >>>> How would we handle a guest rebooting on a kernel without XIVE support ? >>> >>> It will do CAS again and we can change the devices. >> >> So, we would destroy the previous QEMU ICS object and create a new one >> in the CAS hcall. That would probably work. There might be some issues >> in creating and destroying the ICS KVM device, but that can be studied >> without XIVE. > > Adding and removing devices at runtime based on guest requests like > this will get really hairy in qemu.
I confirm ... > As I've said before for the first cut, I think we want to select just > one as a machine option to avoid this confusion. OK > Looking further ahead, I think we'll be better off having both the > XIVE and XICS models always present (at least minimally) in qemu, but > with only one "active" at any given time. Under emulation it is not too complex to support both mode. XIVE and XICS objects are both created but spapr->ov5_cas filters their usage However, syncing the change in KVM is more complex. > Note that having the inactive one destroy and clean up the > corresponding KVM devices is fine, as is deallocating as much of its > runtime state as we can without changing the notional QOM tree. yes. I will try to send a patchset organized that way : - spapr XIVE emulated mode (both mode supported) - XIVE KVM in an exclusive way, the machine will need to be restarted from the command line to change interrupt mode. - support of change of interrupt mode under KVM - powernv device model (rough) C. >> It used to be considered ugly to create a QEMU device at reset time, so >> I wonder if this is still the case, because when the machine reaches CAS, >> we really are beyond reset. >> >> If this is OK, then the next "issue" is to keep in sync the allocated >> IRQ numbers. The IRQ allocator is now merged at the machine level, so >> the synchronization is obvious to do when both backend QEMU objects >> are available. that's the path I took. If both QEMU objects are not >> available, then we need to scan the IRQ number space in the current >> interrupt mode and allocate the same IRQs in the newly negotiated mode. >> Probably OK. I don't see major problems with the current code. >> >> Migration is a problem. We will need both backend QEMU objects to be >> available anyhow if we want to migrate. So we are back to the current >> solution creating both QEMU objects but we can try to defer some of the >> KVM inits and create the KVM device on demand at CAS time. >> >> The next problem is the ICP object that currently needs the KVM device >> fd to connect the vcpus ... So, we will need to change that also. >> That is probably the biggest problem today. We need a way to disconnect >> the vpcu from the KVM device and see how we can defer the connection. >> I need to make sure this is possible, I can check that without XIVE >> I think. >> >>>> Are you suggesting to create the XICS or XIVE device in the CAS >>>> negotiation >>>> process ? So, the machine would not have any interrupt controller before >>>> CAS. That seems really late to me. grub uses the console for instance. >>> >>> We start with XICS by default. >> >> yes. >> >>>> I think it should prepare for both options, start in XIVE legacy mode, >>>> which is XICS, then possibly switch to XIVE exploitation mode. >>>> >>>>>>> And how that will interact with KVM ? >>>>>> >>>>>> I expect we will do the same, which is to create two KVM devices to >>>>>> be able to handle both interrupt controller backends depending on the >>>>>> mode negotiated by the guest. >>>>> >>>>> That will be an ungodly mess, I'd rather we only instanciate the right >>>>> one. >>>> >>>> It's rather transparent currently in the emulated version. There are two >>>> sets of objects in QEMU, switching is done in CAS. KVM support should not >>>> change anything in that area. >>>> >>>> I expect the 'xive-kvm' object to get/set states for migration, just like >>>> for XICS and to setup the ESB+TIMA memory regions, which is new. >>> >>> But both XICS and XIVE are completely different kernel KVM devices that will >>> need to "hook" into the same set of internal hooks for things like >>> interrupts >>> being passed through, RTAS calls etc... >>> >>> How does KVM knows which one to "activate" ? >> >> Can't we add an extra IRQ type and use vcpu->arch.irq_type for that ? >> I haven't studied all the low level details though. >> >>> I don't think the kernel should have both. >> >> I hear that. From a QEMU perspective, it is much easier to put everything >> in place for both interrupt modes and let the guest decide what it wants >> to use. >> >> If we choose not to, we will need to find solution to defer the KVM inits >> and to disconnect/reconnect the vcpus. For the latter, we could add a >> KVM_DISABLE_CAP ioctl or maybe better add a new capability like >> KVM_CAP_IRQ_XIVE to perform the switch. >> >> >> C. >> >