Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 25.02.2013, at 01:59, Paul Mackerras wrote: On Wed, Feb 20, 2013 at 04:58:54PM -0300, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. So, I see Scott already answered from the point of view of his MPIC emulation stuff, but I'll answer too from the point of view of my XICS emulation code. My understanding, possibly imperfect, is that in a real system the routing of GSIs to IOAPICs would either be hardwired or set up by the BIOS, described in ACPI tables, and not modified by the operating system. Is that correct? So my belief is that the GSI routing is fundamentally distinct from and handled differently from the routing of interrupts to CPUs, which is fully under the control of the OS. It's a different layer. I guess there's really some confusion on names here :). I'm always confused when I read sources and you apparently get confused when you read about GSIs. GSIs are an ACPI concept. It's not x86 specific, it's also not APIC specific. It's just a global name space for IRQs. Imagine you have 2 MPICs in your system. But you only want to use a single token / numer to access any IRQ on any chip. That's where GSIs come into play. They map different irqchip IRQs onto a flat number space. To speak with x86 names: Virtualization perspective: QEMU - GSI - IOAPIC - LAPIC - CPU Device perspective: Device irq line - IOAPIC - LAPIC - CPU The IOAPIC is the piece of hardware that interrupt lines get attached to. You connect a pin on it to an irq pin of your device. That talks to the LAPIC to actually schedule interrupts on target CPUs. The LAPIC then fetches interrupts and pulls the CPU's interrupt line. Of course, things are slightly more complicated in the x86 world, as everything behind the IOAPIC also carries a payload defining which pin actually got triggered, but you get the idea. So really just consider GSIs as a global flat number space for irqchip pins. Alex In the XICS model we have a set of interrupt sources, each identified by a 24-bit number. Control operations on an interrupt source just identify the source by its number. Thus the interrupt source number is like a GSI, but we don't need to map that to a different space (e.g. IOAPIC identifier and input number) in order to operate on it, we can just operate on it directly. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 25.02.2013, at 01:59, Paul Mackerras wrote: On Wed, Feb 20, 2013 at 04:58:54PM -0300, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. So, I see Scott already answered from the point of view of his MPIC emulation stuff, but I'll answer too from the point of view of my XICS emulation code. My understanding, possibly imperfect, is that in a real system the routing of GSIs to IOAPICs would either be hardwired or set up by the BIOS, described in ACPI tables, and not modified by the operating system. Is that correct? So my belief is that the GSI routing is fundamentally distinct from and handled differently from the routing of interrupts to CPUs, which is fully under the control of the OS. It's a different layer. I guess there's really some confusion on names here :). I'm always confused when I read sources and you apparently get confused when you read about GSIs. GSIs are an ACPI concept. It's not x86 specific, it's also not APIC specific. It's just a global name space for IRQs. Imagine you have 2 MPICs in your system. But you only want to use a single token / numer to access any IRQ on any chip. That's where GSIs come into play. They map different irqchip IRQs onto a flat number space. To speak with x86 names: Virtualization perspective: QEMU - GSI - IOAPIC - LAPIC - CPU Device perspective: Device irq line - IOAPIC - LAPIC - CPU The IOAPIC is the piece of hardware that interrupt lines get attached to. You connect a pin on it to an irq pin of your device. That talks to the LAPIC to actually schedule interrupts on target CPUs. The LAPIC then fetches interrupts and pulls the CPU's interrupt line. Of course, things are slightly more complicated in the x86 world, as everything behind the IOAPIC also carries a payload defining which pin actually got triggered, but you get the idea. So really just consider GSIs as a global flat number space for irqchip pins. Alex In the XICS model we have a set of interrupt sources, each identified by a 24-bit number. Control operations on an interrupt source just identify the source by its number. Thus the interrupt source number is like a GSI, but we don't need to map that to a different space (e.g. IOAPIC identifier and input number) in order to operate on it, we can just operate on it directly. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote: On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. To start, the whole IRQ routing stuff is poorly documented. Am I supposed to make up GSI numbers and use the routing thing to associate them with real interrupts? You can consider GSI to be a cookie that you use to refer to whatever data you've put into routing table by KVM_IRQ_LINE/irqfd interface. Even on x86, when irq routing is used to inject MSI interrupt, this is exactly how GSI is used. In MSI case it does not have a meaning besides look at that interrupt entry to see what MSI should be injected. Are there constraints on what sort of GSI numbers I can choose (I now see in the code that KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is that documented? The only constrain is that the number should be smalled than KVM_MAX_IRQ_ROUTES, but this is implementation detail. Current implementation uses array to map from GSI to a data, if a lot more entries then currently allowed is needed implementation may be changed to different data structure. It looks like the APIC implementation has default routes, where is that documented?)? It is very PC centric, should not be even compiled for other arches. Where does the code live to manage this table, and how APICy is it (looks like the answer is irq_comm.c, and very)? It is a mistake to refer to the irq routing table as APICy :). It is certainly PC centric currently, but there is at least one HW layer between it and the APIC. PC has global GSI space, each GSI can be delivered via different irq chip. Some GSIs can be delivered through multiple irq chips. Irq routing table provides mapping between GSI and irq chips it should be delivered through. Some irq chips deliver interrupt via APIC some not, but this is different story. The work is needed to make the code not PC centric, but it should not be a lot of work. I suppose I could write another implementation of the table management code for MPIC, though the placement of irqchip inside the route entry, rather than as an argument to KVM_IRQ_LINE, suggests the table is supposed to be global, not in the individual interrupt controller. Yes, it is global. It sits between emulated devices and irq chips. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Wed, Feb 20, 2013 at 04:58:54PM -0300, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. So, I see Scott already answered from the point of view of his MPIC emulation stuff, but I'll answer too from the point of view of my XICS emulation code. My understanding, possibly imperfect, is that in a real system the routing of GSIs to IOAPICs would either be hardwired or set up by the BIOS, described in ACPI tables, and not modified by the operating system. Is that correct? So my belief is that the GSI routing is fundamentally distinct from and handled differently from the routing of interrupts to CPUs, which is fully under the control of the OS. In the XICS model we have a set of interrupt sources, each identified by a 24-bit number. Control operations on an interrupt source just identify the source by its number. Thus the interrupt source number is like a GSI, but we don't need to map that to a different space (e.g. IOAPIC identifier and input number) in order to operate on it, we can just operate on it directly. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote: On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. To start, the whole IRQ routing stuff is poorly documented. Am I supposed to make up GSI numbers and use the routing thing to associate them with real interrupts? You can consider GSI to be a cookie that you use to refer to whatever data you've put into routing table by KVM_IRQ_LINE/irqfd interface. Even on x86, when irq routing is used to inject MSI interrupt, this is exactly how GSI is used. In MSI case it does not have a meaning besides look at that interrupt entry to see what MSI should be injected. Are there constraints on what sort of GSI numbers I can choose (I now see in the code that KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is that documented? The only constrain is that the number should be smalled than KVM_MAX_IRQ_ROUTES, but this is implementation detail. Current implementation uses array to map from GSI to a data, if a lot more entries then currently allowed is needed implementation may be changed to different data structure. It looks like the APIC implementation has default routes, where is that documented?)? It is very PC centric, should not be even compiled for other arches. Where does the code live to manage this table, and how APICy is it (looks like the answer is irq_comm.c, and very)? It is a mistake to refer to the irq routing table as APICy :). It is certainly PC centric currently, but there is at least one HW layer between it and the APIC. PC has global GSI space, each GSI can be delivered via different irq chip. Some GSIs can be delivered through multiple irq chips. Irq routing table provides mapping between GSI and irq chips it should be delivered through. Some irq chips deliver interrupt via APIC some not, but this is different story. The work is needed to make the code not PC centric, but it should not be a lot of work. I suppose I could write another implementation of the table management code for MPIC, though the placement of irqchip inside the route entry, rather than as an argument to KVM_IRQ_LINE, suggests the table is supposed to be global, not in the individual interrupt controller. Yes, it is global. It sits between emulated devices and irq chips. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Sat, Feb 16, 2013 at 01:56:14PM +1100, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote: On 02/15/2013 05:18:31 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. I don't think it makes such an assumption. The MPIC device has physical registers, so it exposes them, but it also exposes things that are not physical registers (e.g. the per-IRQ input state). The generic device control layer leaves interpretation of attributes up to the device. I think it would be easier to fit XICS into the device control api model than to fit MPIC into this model, not to mention what would happen if we later want to emulate some other type of device -- x86 already has at least one non-irqchip emulated device (i8254). I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. Further, the magic means that you can only have one instance of the device, whereas you might want to model the interrupt controller architecture as several devices. You could do that using several device types, but then the interconnections between them would also be magic. Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. snip -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. To start, the whole IRQ routing stuff is poorly documented. Am I supposed to make up GSI numbers and use the routing thing to associate them with real interrupts? Are there constraints on what sort of GSI numbers I can choose (I now see in the code that KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is that documented? It looks like the APIC implementation has default routes, where is that documented?)? Where does the code live to manage this table, and how APICy is it (looks like the answer is irq_comm.c, and very)? I suppose I could write another implementation of the table management code for MPIC, though the placement of irqchip inside the route entry, rather than as an argument to KVM_IRQ_LINE, suggests the table is supposed to be global, not in the individual interrupt controller. It looks like I'm going to have to do this anyway for irqfd, though that doesn't make the other uses of the device control api go away. Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the status for debugging (reading device registers doesn't quite do it, since the active bit won't show up if the interrupt is masked). At that point, is it more offensive to make it read-only even though it would be trivial to make it read/write (which would allow users who don't need it to bypass the routing API), or to make it read/write and live with there being more than one way to do something? KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of state, and because it doesn't allow debugging access to device registers (e.g. inspecting from the QEMU command line), and because it's hard to add new pieces of state if we realize we left something out. It reminds be of GET/SET_SREGS. With that, I did what you seem to want here, which is to adapt the existing interfaces, using feature flags to control optional state. It ended up being a mess, and ONE_REG was introduced as a replacement. The device control API is the equivalent of ONE_REG for things other than vcpus. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote: On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. To start, the whole IRQ routing stuff is poorly documented. Am I supposed to make up GSI numbers and use the routing thing to associate them with real interrupts? I have no idea. Is mapping from one integer linear space (GSIs) to real interrupts suitable for you? Are there constraints on what sort of GSI numbers I can choose (I now see in the code that KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is that documented? Don't think it is. It looks like the APIC implementation has default routes, where is that documented?)? In the code. Where does the code live to manage this table, and how APICy is it (looks like the answer is irq_comm.c, and very)? Thinking faster than typing? Not sure what you mean. I suppose I could write another implementation of the table management code for MPIC, though the placement of irqchip inside the route entry, rather than as an argument to KVM_IRQ_LINE, suggests the table is supposed to be global, not in the individual interrupt controller. Yes the table is global. It maps GSI (Global System Interrupt IIRC) (integer) to (irqchip,pin) pair. It looks like I'm going to have to do this anyway for irqfd, though that doesn't make the other uses of the device control api go away. Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the status for debugging (reading device registers doesn't quite do it, since the active bit won't show up if the interrupt is masked). At that point, is it more offensive to make it read-only even though it would be trivial to make it read/write (which would allow users who don't need it to bypass the routing API), or to make it read/write and live with there being more than one way to do something? Can't follow this sentence. KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of state, and because it doesn't allow debugging access to device registers (e.g. inspecting from the QEMU command line), and because it's hard to add new pieces of state if we realize we left something out. It reminds be of GET/SET_SREGS. With that, I did what you seem to want here, which is to adapt the existing interfaces, using feature flags to control optional state. It ended up being a mess, and ONE_REG was introduced as a replacement. The device control API is the equivalent of ONE_REG for things other than vcpus. -Scott - ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2? - Agree on poor extensibility of interface. Adding a reserved amount of space as padding and versioning such as has been done so far is not acceptable? - Debugging: why is reading entire register state not acceptable? Yes, its slow. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/20/2013 07:09:55 PM, Marcelo Tosatti wrote: On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote: On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. To start, the whole IRQ routing stuff is poorly documented. Am I supposed to make up GSI numbers and use the routing thing to associate them with real interrupts? I have no idea. Is mapping from one integer linear space (GSIs) to real interrupts suitable for you? I can live with it. Where does the code live to manage this table, and how APICy is it (looks like the answer is irq_comm.c, and very)? Thinking faster than typing? Not sure what you mean. Sorry... The code to manage the table lives in virt/kvm/irq_comm.c. That code is very APIC-specific and not currently in a state that invites sharing. It looks like I'm going to have to do this anyway for irqfd, though that doesn't make the other uses of the device control api go away. Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the status for debugging (reading device registers doesn't quite do it, since the active bit won't show up if the interrupt is masked). At that point, is it more offensive to make it read-only even though it would be trivial to make it read/write (which would allow users who don't need it to bypass the routing API), or to make it read/write and live with there being more than one way to do something? Can't follow this sentence. Suppose, for the sake of irqfd (and maillist tranquility) MPIC uses existing KVM_IRQ_LINE and such. Will there be objection to being able to use KVM_GET_DEVICE_ATTR to *get* the irq line status for debugging purposes (maybe also useful for migration)? If there's no objection to that, would there be any actual reason, beyond saving a few lines of glue code, to make it a read-only attribute? KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of state, and because it doesn't allow debugging access to device registers (e.g. inspecting from the QEMU command line), and because it's hard to add new pieces of state if we realize we left something out. It reminds be of GET/SET_SREGS. With that, I did what you seem to want here, which is to adapt the existing interfaces, using feature flags to control optional state. It ended up being a mess, and ONE_REG was introduced as a replacement. The device control API is the equivalent of ONE_REG for things other than vcpus. -Scott - ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2? Well, that's what KVM_SET_DEVICE_ATTR is. - Agree on poor extensibility of interface. Adding a reserved amount of space as padding and versioning such as has been done so far is not acceptable? That's exactly what we did with SREGS, and it got declared a mess and replaced with ONE_REG. I'm trying to learn from my mistakes. :-) - Debugging: why is reading entire register state not acceptable? Yes, its slow. For one, it's more work. The current way works by simulating a guest MMIO access. No blob format to design, implement, and keep compatible. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. To start, the whole IRQ routing stuff is poorly documented. Am I supposed to make up GSI numbers and use the routing thing to associate them with real interrupts? Are there constraints on what sort of GSI numbers I can choose (I now see in the code that KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is that documented? It looks like the APIC implementation has default routes, where is that documented?)? Where does the code live to manage this table, and how APICy is it (looks like the answer is irq_comm.c, and very)? I suppose I could write another implementation of the table management code for MPIC, though the placement of irqchip inside the route entry, rather than as an argument to KVM_IRQ_LINE, suggests the table is supposed to be global, not in the individual interrupt controller. It looks like I'm going to have to do this anyway for irqfd, though that doesn't make the other uses of the device control api go away. Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the status for debugging (reading device registers doesn't quite do it, since the active bit won't show up if the interrupt is masked). At that point, is it more offensive to make it read-only even though it would be trivial to make it read/write (which would allow users who don't need it to bypass the routing API), or to make it read/write and live with there being more than one way to do something? KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of state, and because it doesn't allow debugging access to device registers (e.g. inspecting from the QEMU command line), and because it's hard to add new pieces of state if we realize we left something out. It reminds be of GET/SET_SREGS. With that, I did what you seem to want here, which is to adapt the existing interfaces, using feature flags to control optional state. It ended up being a mess, and ONE_REG was introduced as a replacement. The device control API is the equivalent of ONE_REG for things other than vcpus. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote: On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. To start, the whole IRQ routing stuff is poorly documented. Am I supposed to make up GSI numbers and use the routing thing to associate them with real interrupts? I have no idea. Is mapping from one integer linear space (GSIs) to real interrupts suitable for you? Are there constraints on what sort of GSI numbers I can choose (I now see in the code that KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is that documented? Don't think it is. It looks like the APIC implementation has default routes, where is that documented?)? In the code. Where does the code live to manage this table, and how APICy is it (looks like the answer is irq_comm.c, and very)? Thinking faster than typing? Not sure what you mean. I suppose I could write another implementation of the table management code for MPIC, though the placement of irqchip inside the route entry, rather than as an argument to KVM_IRQ_LINE, suggests the table is supposed to be global, not in the individual interrupt controller. Yes the table is global. It maps GSI (Global System Interrupt IIRC) (integer) to (irqchip,pin) pair. It looks like I'm going to have to do this anyway for irqfd, though that doesn't make the other uses of the device control api go away. Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the status for debugging (reading device registers doesn't quite do it, since the active bit won't show up if the interrupt is masked). At that point, is it more offensive to make it read-only even though it would be trivial to make it read/write (which would allow users who don't need it to bypass the routing API), or to make it read/write and live with there being more than one way to do something? Can't follow this sentence. KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of state, and because it doesn't allow debugging access to device registers (e.g. inspecting from the QEMU command line), and because it's hard to add new pieces of state if we realize we left something out. It reminds be of GET/SET_SREGS. With that, I did what you seem to want here, which is to adapt the existing interfaces, using feature flags to control optional state. It ended up being a mess, and ONE_REG was introduced as a replacement. The device control API is the equivalent of ONE_REG for things other than vcpus. -Scott - ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2? - Agree on poor extensibility of interface. Adding a reserved amount of space as padding and versioning such as has been done so far is not acceptable? - Debugging: why is reading entire register state not acceptable? Yes, its slow. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/20/2013 07:09:55 PM, Marcelo Tosatti wrote: On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote: On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote: This is probably a stupid question, but why the KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for your purposes? x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING. To start, the whole IRQ routing stuff is poorly documented. Am I supposed to make up GSI numbers and use the routing thing to associate them with real interrupts? I have no idea. Is mapping from one integer linear space (GSIs) to real interrupts suitable for you? I can live with it. Where does the code live to manage this table, and how APICy is it (looks like the answer is irq_comm.c, and very)? Thinking faster than typing? Not sure what you mean. Sorry... The code to manage the table lives in virt/kvm/irq_comm.c. That code is very APIC-specific and not currently in a state that invites sharing. It looks like I'm going to have to do this anyway for irqfd, though that doesn't make the other uses of the device control api go away. Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the status for debugging (reading device registers doesn't quite do it, since the active bit won't show up if the interrupt is masked). At that point, is it more offensive to make it read-only even though it would be trivial to make it read/write (which would allow users who don't need it to bypass the routing API), or to make it read/write and live with there being more than one way to do something? Can't follow this sentence. Suppose, for the sake of irqfd (and maillist tranquility) MPIC uses existing KVM_IRQ_LINE and such. Will there be objection to being able to use KVM_GET_DEVICE_ATTR to *get* the irq line status for debugging purposes (maybe also useful for migration)? If there's no objection to that, would there be any actual reason, beyond saving a few lines of glue code, to make it a read-only attribute? KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of state, and because it doesn't allow debugging access to device registers (e.g. inspecting from the QEMU command line), and because it's hard to add new pieces of state if we realize we left something out. It reminds be of GET/SET_SREGS. With that, I did what you seem to want here, which is to adapt the existing interfaces, using feature flags to control optional state. It ended up being a mess, and ONE_REG was introduced as a replacement. The device control API is the equivalent of ONE_REG for things other than vcpus. -Scott - ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2? Well, that's what KVM_SET_DEVICE_ATTR is. - Agree on poor extensibility of interface. Adding a reserved amount of space as padding and versioning such as has been done so far is not acceptable? That's exactly what we did with SREGS, and it got declared a mess and replaced with ONE_REG. I'm trying to learn from my mistakes. :-) - Debugging: why is reading entire register state not acceptable? Yes, its slow. For one, it's more work. The current way works by simulating a guest MMIO access. No blob format to design, implement, and keep compatible. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Mon, Feb 18, 2013 at 04:43:27PM -0600, Scott Wood wrote: On 02/15/2013 10:51:16 PM, Paul Mackerras wrote: The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Hooking up to the CPU's interrupt lines is implicit in creating an MPIC (and I'm fine with changing that), not in creating any device. I don't see how it's worse than being implicit in calling KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips). First, KVM_CREATE_IRQCHIP_ARGS specifies the overall architecture of the interrupt control subsystem, so yes it does allow for cascaded controllers. Secondly, the difference is that if you see a KVM_CREATE_IRQCHIP_ARGS call, you know that the vcpus' interrupt inputs will be driven by kernel code. If you see a KVM_CREATE_DEVICE call, you don't know that; they might be, or they might not be. You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? No. Accesses by the guest get handled in the kernel. Accesses in QEMU, including MSIs generated by virtio, get forwarded to the kernel. OK, I missed the path where that gets done, then. It would be the current task priority. I assume MPIC maintains a 16-bit map of the interrupt priorities in service, so that would need to be added. We don't maintain such a map in the emulation code. We have a Oh, so how do you handle EOI of nested interrupts? How do you know what to reset the CPU priority to in that case? per-CPU bitmap of the actual interrupt sources pending/active, which is another attribute that would need to be added in order to support migration on MPIC. Not really, that can be recomputed from the sources easily enough. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/19/2013 06:41:11 PM, Paul Mackerras wrote: On Mon, Feb 18, 2013 at 04:43:27PM -0600, Scott Wood wrote: On 02/15/2013 10:51:16 PM, Paul Mackerras wrote: The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Hooking up to the CPU's interrupt lines is implicit in creating an MPIC (and I'm fine with changing that), not in creating any device. I don't see how it's worse than being implicit in calling KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips). First, KVM_CREATE_IRQCHIP_ARGS specifies the overall architecture of the interrupt control subsystem, so yes it does allow for cascaded controllers. Fine, but in that case you're dealing with a new irqchip (or irqarch if you prefer) type number (or param). If you wanted to take that approach with cascaded MPICs (I wouldn't), you would create a new device type number. I don't see the difference here. Secondly, the difference is that if you see a KVM_CREATE_IRQCHIP_ARGS call, you know that the vcpus' interrupt inputs will be driven by kernel code. If you see a KVM_CREATE_DEVICE call, you don't know that; they might be, or they might not be. I just don't understand what you mean here. Nobody's suggesting that we make this assumption as soon as you see a KVM_CREATE_DEVICE call for any random device. It's specifically in the creation of a KVM_DEV_TYPE_FSL_MPIC_20 or KVM_DEV_TYPE_FSL_MPIC_42 that this assumption is currently made. I don't see how creation of one of those specific devices is any different from calling KVM_CREATE_IRQCHIP_ARGS in terms of the intent that can reasonably be inferred. You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? No. Accesses by the guest get handled in the kernel. Accesses in QEMU, including MSIs generated by virtio, get forwarded to the kernel. OK, I missed the path where that gets done, then. ppce500_pci has a memory region for e500-pci-bar0 which is an alias for the ccsr memory region. The mpic registers are a child of the ccsr memory region. When hw/kvm/mpic.c in QEMU sees an access to that mpic memory region, it forwards the access to the kernel via KVM_DEV_MPIC_GRP_REGISTER. It would be the current task priority. I assume MPIC maintains a 16-bit map of the interrupt priorities in service, so that would need to be added. We don't maintain such a map in the emulation code. We have a Oh, so how do you handle EOI of nested interrupts? We scan the in-service bitmap for pending interrupts, and choose the one that has the highest priority as the one that the EOI must be referring to. Just having a bitmap of priorities would only tell us the priority of the highest priority in-service interrupt, not which actual interrupt it is. How do you know what to reset the CPU priority to in that case? Once we identify the interrupt that is being EOId as above, we clear that bit and check again to see if there's a remaining interrupt whose priority is high enough to prevent a pending interrupt from being delivered. per-CPU bitmap of the actual interrupt sources pending/active, which is another attribute that would need to be added in order to support migration on MPIC. Not really, that can be recomputed from the sources easily enough. I'm skeptical. IPIs at least would be a problem, as would other multicast interrupts if we allowed that. How would we distinguish an interrupt that is pending from one that is in-service, just from the sources? In any case, it's a bit premature to discuss what we'd need for migration until QEMU itself can save/restore a normal QEMU openpic. When that time comes, attributes can be added for whatever extra state we need (if any) to extend that capability to in-kernel MPICs. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Mon, Feb 18, 2013 at 04:43:27PM -0600, Scott Wood wrote: On 02/15/2013 10:51:16 PM, Paul Mackerras wrote: The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Hooking up to the CPU's interrupt lines is implicit in creating an MPIC (and I'm fine with changing that), not in creating any device. I don't see how it's worse than being implicit in calling KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips). First, KVM_CREATE_IRQCHIP_ARGS specifies the overall architecture of the interrupt control subsystem, so yes it does allow for cascaded controllers. Secondly, the difference is that if you see a KVM_CREATE_IRQCHIP_ARGS call, you know that the vcpus' interrupt inputs will be driven by kernel code. If you see a KVM_CREATE_DEVICE call, you don't know that; they might be, or they might not be. You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? No. Accesses by the guest get handled in the kernel. Accesses in QEMU, including MSIs generated by virtio, get forwarded to the kernel. OK, I missed the path where that gets done, then. It would be the current task priority. I assume MPIC maintains a 16-bit map of the interrupt priorities in service, so that would need to be added. We don't maintain such a map in the emulation code. We have a Oh, so how do you handle EOI of nested interrupts? How do you know what to reset the CPU priority to in that case? per-CPU bitmap of the actual interrupt sources pending/active, which is another attribute that would need to be added in order to support migration on MPIC. Not really, that can be recomputed from the sources easily enough. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 10:51:16 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote: On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. You can have cascaded MPICs; I once had a powerbook that had one MPIC cascaded into another. OK. How is the explicit request made in this patchset? The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Hooking up to the CPU's interrupt lines is implicit in creating an MPIC (and I'm fine with changing that), not in creating any device. I don't see how it's worse than being implicit in calling KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips). The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... With 24 bits of interrupt source number, you don't have to flatten it very far. IPIs are handled separately and don't appear in the interrupt source space. They do need to appear in the interrupt source space if we want to inject or irqfd them. Most users won't want to do that, but we have had a customer directly assign IPIs (to communicate with an OS running on the other CPU in an AMP setup -- host Linux was non-SMP so wasn't using them) and MPIC timers to a guest. How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? No. Accesses by the guest get handled in the kernel. Accesses in QEMU, including MSIs generated by virtio, get forwarded to the kernel. There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. There's enough space for MPIC to have 16 bits of vector and some flags. We don't need to overdesign this. I view anything other than passing the actual MPIC register values around as overdesign here, given that it is communication between hw/kvm/mpic.c on the QEMU side and arch/powerpc/kvm/mpic.c on the kernel side. interrupts awaiting service (or even in service, if different priorities). We have both current task priority (which is a user-set mask-by-priority register) and the priority of the highest-prio in-service interrupt -- which would current processor priorty be? Etc. It would be the current task priority. I assume MPIC maintains a 16-bit map of the interrupt priorities in service, so that would need to be added. We don't maintain such a map in the emulation code. We have a per-CPU bitmap of the actual interrupt sources pending/active, which is another attribute that would need to be added in order to support migration on MPIC. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 10:51:16 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote: On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. You can have cascaded MPICs; I once had a powerbook that had one MPIC cascaded into another. OK. How is the explicit request made in this patchset? The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Hooking up to the CPU's interrupt lines is implicit in creating an MPIC (and I'm fine with changing that), not in creating any device. I don't see how it's worse than being implicit in calling KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips). The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... With 24 bits of interrupt source number, you don't have to flatten it very far. IPIs are handled separately and don't appear in the interrupt source space. They do need to appear in the interrupt source space if we want to inject or irqfd them. Most users won't want to do that, but we have had a customer directly assign IPIs (to communicate with an OS running on the other CPU in an AMP setup -- host Linux was non-SMP so wasn't using them) and MPIC timers to a guest. How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? No. Accesses by the guest get handled in the kernel. Accesses in QEMU, including MSIs generated by virtio, get forwarded to the kernel. There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. There's enough space for MPIC to have 16 bits of vector and some flags. We don't need to overdesign this. I view anything other than passing the actual MPIC register values around as overdesign here, given that it is communication between hw/kvm/mpic.c on the QEMU side and arch/powerpc/kvm/mpic.c on the kernel side. interrupts awaiting service (or even in service, if different priorities). We have both current task priority (which is a user-set mask-by-priority register) and the priority of the highest-prio in-service interrupt -- which would current processor priorty be? Etc. It would be the current task priority. I assume MPIC maintains a 16-bit map of the interrupt priorities in service, so that would need to be added. We don't maintain such a map in the emulation code. We have a per-CPU bitmap of the actual interrupt sources pending/active, which is another attribute that would need to be added in order to support migration on MPIC. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. +4.80 KVM_CREATE_IRQCHIP_ARGS + +Capability: KVM_CAP_IRQCHIP_ARGS +Architectures: ppc Why just ppc? +struct kvm_irq_sources { + __u32 irq; + __u32 nr_irqs; + __u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT 0 +#define KVM_IRQ_SERVER_MASK0xULL +#define KVM_IRQ_PRIORITY_SHIFT 32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE(1ULL 40) +#define KVM_IRQ_MASKED (1ULL 41) +#define KVM_IRQ_PENDING(1ULL 42) What does server mean? Do you mean laid out like this for XICS? Let's please have a clean separation between what is generic and what is implementation-specific. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, the only time userspace needs to use these is at initialization, when it does a SET_SOURCES to create the interrupt sources it wants the VM to have. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. +4.80 KVM_CREATE_IRQCHIP_ARGS + +Capability: KVM_CAP_IRQCHIP_ARGS +Architectures: ppc Why just ppc? Just because only PPC has code to handle it at this point. I would hope that ARM and others could pick it up. Maybe I should make it: +Architectures: all (but so far only implemented on ppc) or something. +struct kvm_irq_sources { +__u32 irq; +__u32 nr_irqs; +__u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. Hmmm, you're right. I suppose it will have to be a fixed-size buffer, which is generally less efficient, but since it's mainly for migration that probably doesn't matter. In fact, I could probably use the existing KVM_GET_IRQCHIP and KVM_SET_IRQCHIP ioctls, if I use the `chip_id' field for `irq' and the `pad' field for `nr_irqs'. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT0 +#define KVM_IRQ_SERVER_MASK 0xULL +#define KVM_IRQ_PRIORITY_SHIFT 32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL 40) +#define KVM_IRQ_MASKED (1ULL 41) +#define KVM_IRQ_PENDING (1ULL 42) What does server mean? Do you mean laid out like this for XICS? Sorry, I should have made that destination rather than server. You're right, server is confusing, but it just means where the interrupt is sent to be handled. It has nothing particularly to do with server computers. Let's please have a clean separation between what is generic and what is implementation-specific. I believe that the interface is pretty cleanly generic - the model is a set of interrupt sources and some per-vcpu state, with priorities to decide which interrupts get delivered when. That describes the basics of just about any SMP-capable interrupt controller, including MPIC. MPIC would still need an extra interface for userspace to save and restore
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 05:18:31 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. I don't think it makes such an assumption. The MPIC device has physical registers, so it exposes them, but it also exposes things that are not physical registers (e.g. the per-IRQ input state). The generic device control layer leaves interpretation of attributes up to the device. I think it would be easier to fit XICS into the device control api model than to fit MPIC into this model, not to mention what would happen if we later want to emulate some other type of device -- x86 already has at least one non-irqchip emulated device (i8254). Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. I wasn't there either. It's fine to have discussions at such events, but it should not preclude others from having input, or better ideas from being considered afterward. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. In that case, the state it describes is insufficient for MPIC. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. How do MSIs get injected? BTW, do you have any plans regarding irqfd? +struct kvm_irq_sources { + __u32 irq; + __u32 nr_irqs; + __u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. Hmmm, you're right. I suppose it will have to be a fixed-size buffer, It doesn't need to be a fixed size buffer. You can still have pointers, but they need to be represented as a plain __u64 with users casting the pointer. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT 0 +#define KVM_IRQ_SERVER_MASK 0xULL +#define KVM_IRQ_PRIORITY_SHIFT32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL 40) +#define KVM_IRQ_MASKED(1ULL 41) +#define KVM_IRQ_PENDING (1ULL 42) What does server mean? Do you mean laid out like this for XICS? Sorry, I should have made that
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote: On 02/15/2013 05:18:31 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. I don't think it makes such an assumption. The MPIC device has physical registers, so it exposes them, but it also exposes things that are not physical registers (e.g. the per-IRQ input state). The generic device control layer leaves interpretation of attributes up to the device. I think it would be easier to fit XICS into the device control api model than to fit MPIC into this model, not to mention what would happen if we later want to emulate some other type of device -- x86 already has at least one non-irqchip emulated device (i8254). I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. Further, the magic means that you can only have one instance of the device, whereas you might want to model the interrupt controller architecture as several devices. You could do that using several device types, but then the interconnections between them would also be magic. Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. I wasn't there either. It's fine to have discussions at such events, but it should not preclude others from having input, or better ideas from being considered afterward. Sure - I was just trying to fit in with the expressed wish of the maintainer. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. In that case, the state it describes is insufficient for MPIC. Yes, MPIC has other random stuff in it besides interrupt control, so that's not surprising. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. How is the explicit request made in this patchset? Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Yes. I am unconvinced that such an abstraction is well-advised (especially after seeing existing generic interfaces that are clearly APIC-oriented). This isn't like normal driver interfaces where we're abstracting away hardware differences to let generic code use a device. Userspace knows what kind of device it wants, and how it wants it to integrate with the rest of the emulated system. We'd have to go out of our way to apply the abstraction on *both* ends. What do we get from that other than a chance that the abstraction leaks? What significant code actually becomes common? kvm_set_irq() is just a thin wrapper around the ioctl. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. Why? The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) BTW, do you have any plans regarding irqfd? I'm going to look at that next. Likewise... We should probably coordinate our efforts so that at least the de-APICization of the code only has to get done once. What about interrupt controllers that allow multiple destinations? The destination can be an identifier for a group of vcpus, or even a bitmap -- that's why I made it 32 bits. So you can have single delivery, or be limited to 32 vcpus, or have to implement some destination ID allocation scheme (which is more state that needs to be accessible somehow). More than 256 priorities? Different levels of output (normal, critical, machine check)? Programmable vector numbers? Active high/low control? There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. (BTW, I think having more than 256 priorities would be insane - do you know of any actual example that does?) No, but hardware designers have been known to do insane things. The per-vcpu state isn't even part of this AFAICT. It's an XICS-specific ONE_REG -- which is fine, but all that's left of the generic API is the get/set sources which is an imperfect match to our per-IRQ state and it's not clear how an
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote: On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. You can have cascaded MPICs; I once had a powerbook that had one MPIC cascaded into another. How is the explicit request made in this patchset? The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Yes. I am unconvinced that such an abstraction is well-advised (especially after seeing existing generic interfaces that are clearly APIC-oriented). This isn't like normal driver interfaces where we're abstracting away hardware differences to let generic code use a device. Userspace knows what kind of device it wants, and how it wants it to integrate with the rest of the emulated system. We'd have to go out of our way to apply the abstraction on *both* ends. What do we get from that other than a chance that the abstraction leaks? What significant code actually becomes common? kvm_set_irq() is just a thin wrapper around the ioctl. I'd think there could be some code reduction in the live migration code, but I'd need a qemu hacker to chime in here. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. Why? The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... With 24 bits of interrupt source number, you don't have to flatten it very far. IPIs are handled separately and don't appear in the interrupt source space. How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? If that's so, then why bother with in-kernel emulation at all? Once you're back in userspace, it's just as fast to do the emulation there as in the kernel. There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. There's enough space for MPIC to have 16 bits of vector and some flags. We don't need to overdesign this. Yes, the names of the bitfields in the ICP state word are XICS-specific, but the concepts are pretty generic - current processor priority, identifier for interrupt awaiting service, pending IPI request priority, pending interrupt request priority. We don't have separate concepts of pending IPI request priority and pending interrupt request priority. There can be multiple Sorry, I meant pending interrupt request. You do have that, it's what you read from the interrupt acknowledge register. interrupts awaiting service (or even in service, if different priorities). We have both current task priority (which is a user-set
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. +4.80 KVM_CREATE_IRQCHIP_ARGS + +Capability: KVM_CAP_IRQCHIP_ARGS +Architectures: ppc Why just ppc? +struct kvm_irq_sources { + __u32 irq; + __u32 nr_irqs; + __u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT 0 +#define KVM_IRQ_SERVER_MASK0xULL +#define KVM_IRQ_PRIORITY_SHIFT 32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE(1ULL 40) +#define KVM_IRQ_MASKED (1ULL 41) +#define KVM_IRQ_PENDING(1ULL 42) What does server mean? Do you mean laid out like this for XICS? Let's please have a clean separation between what is generic and what is implementation-specific. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, the only time userspace needs to use these is at initialization, when it does a SET_SOURCES to create the interrupt sources it wants the VM to have. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. +4.80 KVM_CREATE_IRQCHIP_ARGS + +Capability: KVM_CAP_IRQCHIP_ARGS +Architectures: ppc Why just ppc? Just because only PPC has code to handle it at this point. I would hope that ARM and others could pick it up. Maybe I should make it: +Architectures: all (but so far only implemented on ppc) or something. +struct kvm_irq_sources { +__u32 irq; +__u32 nr_irqs; +__u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. Hmmm, you're right. I suppose it will have to be a fixed-size buffer, which is generally less efficient, but since it's mainly for migration that probably doesn't matter. In fact, I could probably use the existing KVM_GET_IRQCHIP and KVM_SET_IRQCHIP ioctls, if I use the `chip_id' field for `irq' and the `pad' field for `nr_irqs'. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT0 +#define KVM_IRQ_SERVER_MASK 0xULL +#define KVM_IRQ_PRIORITY_SHIFT 32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL 40) +#define KVM_IRQ_MASKED (1ULL 41) +#define KVM_IRQ_PENDING (1ULL 42) What does server mean? Do you mean laid out like this for XICS? Sorry, I should have made that destination rather than server. You're right, server is confusing, but it just means where the interrupt is sent to be handled. It has nothing particularly to do with server computers. Let's please have a clean separation between what is generic and what is implementation-specific. I believe that the interface is pretty cleanly generic - the model is a set of interrupt sources and some per-vcpu state, with priorities to decide which interrupts get delivered when. That describes the basics of just about any SMP-capable interrupt controller, including MPIC. MPIC would still need an extra interface for userspace to save and restore
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 05:18:31 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. I don't think it makes such an assumption. The MPIC device has physical registers, so it exposes them, but it also exposes things that are not physical registers (e.g. the per-IRQ input state). The generic device control layer leaves interpretation of attributes up to the device. I think it would be easier to fit XICS into the device control api model than to fit MPIC into this model, not to mention what would happen if we later want to emulate some other type of device -- x86 already has at least one non-irqchip emulated device (i8254). Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. I wasn't there either. It's fine to have discussions at such events, but it should not preclude others from having input, or better ideas from being considered afterward. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. In that case, the state it describes is insufficient for MPIC. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. How do MSIs get injected? BTW, do you have any plans regarding irqfd? +struct kvm_irq_sources { + __u32 irq; + __u32 nr_irqs; + __u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. Hmmm, you're right. I suppose it will have to be a fixed-size buffer, It doesn't need to be a fixed size buffer. You can still have pointers, but they need to be represented as a plain __u64 with users casting the pointer. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT 0 +#define KVM_IRQ_SERVER_MASK 0xULL +#define KVM_IRQ_PRIORITY_SHIFT32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL 40) +#define KVM_IRQ_MASKED(1ULL 41) +#define KVM_IRQ_PENDING (1ULL 42) What does server mean? Do you mean laid out like this for XICS? Sorry, I should have made that
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote: On 02/15/2013 05:18:31 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. I don't think it makes such an assumption. The MPIC device has physical registers, so it exposes them, but it also exposes things that are not physical registers (e.g. the per-IRQ input state). The generic device control layer leaves interpretation of attributes up to the device. I think it would be easier to fit XICS into the device control api model than to fit MPIC into this model, not to mention what would happen if we later want to emulate some other type of device -- x86 already has at least one non-irqchip emulated device (i8254). I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. Further, the magic means that you can only have one instance of the device, whereas you might want to model the interrupt controller architecture as several devices. You could do that using several device types, but then the interconnections between them would also be magic. Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. I wasn't there either. It's fine to have discussions at such events, but it should not preclude others from having input, or better ideas from being considered afterward. Sure - I was just trying to fit in with the expressed wish of the maintainer. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. In that case, the state it describes is insufficient for MPIC. Yes, MPIC has other random stuff in it besides interrupt control, so that's not surprising. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. How is the explicit request made in this patchset? Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Yes. I am unconvinced that such an abstraction is well-advised (especially after seeing existing generic interfaces that are clearly APIC-oriented). This isn't like normal driver interfaces where we're abstracting away hardware differences to let generic code use a device. Userspace knows what kind of device it wants, and how it wants it to integrate with the rest of the emulated system. We'd have to go out of our way to apply the abstraction on *both* ends. What do we get from that other than a chance that the abstraction leaks? What significant code actually becomes common? kvm_set_irq() is just a thin wrapper around the ioctl. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. Why? The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) BTW, do you have any plans regarding irqfd? I'm going to look at that next. Likewise... We should probably coordinate our efforts so that at least the de-APICization of the code only has to get done once. What about interrupt controllers that allow multiple destinations? The destination can be an identifier for a group of vcpus, or even a bitmap -- that's why I made it 32 bits. So you can have single delivery, or be limited to 32 vcpus, or have to implement some destination ID allocation scheme (which is more state that needs to be accessible somehow). More than 256 priorities? Different levels of output (normal, critical, machine check)? Programmable vector numbers? Active high/low control? There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. (BTW, I think having more than 256 priorities would be insane - do you know of any actual example that does?) No, but hardware designers have been known to do insane things. The per-vcpu state isn't even part of this AFAICT. It's an XICS-specific ONE_REG -- which is fine, but all that's left of the generic API is the get/set sources which is an imperfect match to our per-IRQ state and it's not clear how an
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote: On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. You can have cascaded MPICs; I once had a powerbook that had one MPIC cascaded into another. How is the explicit request made in this patchset? The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Yes. I am unconvinced that such an abstraction is well-advised (especially after seeing existing generic interfaces that are clearly APIC-oriented). This isn't like normal driver interfaces where we're abstracting away hardware differences to let generic code use a device. Userspace knows what kind of device it wants, and how it wants it to integrate with the rest of the emulated system. We'd have to go out of our way to apply the abstraction on *both* ends. What do we get from that other than a chance that the abstraction leaks? What significant code actually becomes common? kvm_set_irq() is just a thin wrapper around the ioctl. I'd think there could be some code reduction in the live migration code, but I'd need a qemu hacker to chime in here. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. Why? The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... With 24 bits of interrupt source number, you don't have to flatten it very far. IPIs are handled separately and don't appear in the interrupt source space. How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? If that's so, then why bother with in-kernel emulation at all? Once you're back in userspace, it's just as fast to do the emulation there as in the kernel. There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. There's enough space for MPIC to have 16 bits of vector and some flags. We don't need to overdesign this. Yes, the names of the bitfields in the ICP state word are XICS-specific, but the concepts are pretty generic - current processor priority, identifier for interrupt awaiting service, pending IPI request priority, pending interrupt request priority. We don't have separate concepts of pending IPI request priority and pending interrupt request priority. There can be multiple Sorry, I meant pending interrupt request. You do have that, it's what you read from the interrupt acknowledge register. interrupts awaiting service (or even in service, if different priorities). We have both current task priority (which is a user-set