Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-03-21 Thread Alexander Graf

On 25.02.2013, at 01:59, Paul Mackerras wrote:

 On Wed, Feb 20, 2013 at 04:58:54PM -0300, Marcelo Tosatti wrote:
 
 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
 
 So, I see Scott already answered from the point of view of his MPIC
 emulation stuff, but I'll answer too from the point of view of my XICS
 emulation code.
 
 My understanding, possibly imperfect, is that in a real system the
 routing of GSIs to IOAPICs would either be hardwired or set up by the
 BIOS, described in ACPI tables, and not modified by the operating
 system.  Is that correct?  So my belief is that the GSI routing is
 fundamentally distinct from and handled differently from the routing
 of interrupts to CPUs, which is fully under the control of the OS.

It's a different layer. I guess there's really some confusion on names here :). 
I'm always confused when I read sources and you apparently get confused when 
you read about GSIs.

GSIs are an ACPI concept. It's not x86 specific, it's also not APIC specific. 
It's just a global name space for IRQs.

Imagine you have 2 MPICs in your system. But you only want to use a single 
token / numer to access any IRQ on any chip. That's where GSIs come into play. 
They map different irqchip IRQs onto a flat number space. To speak with x86 
names:

Virtualization perspective:

  QEMU - GSI - IOAPIC - LAPIC - CPU

Device perspective:

  Device irq line - IOAPIC - LAPIC - CPU


The IOAPIC is the piece of hardware that interrupt lines get attached to. You 
connect a pin on it to an irq pin of your device. That talks to the LAPIC to 
actually schedule interrupts on target CPUs. The LAPIC then fetches interrupts 
and pulls the CPU's interrupt line.

Of course, things are slightly more complicated in the x86 world, as everything 
behind the IOAPIC also carries a payload defining which pin actually got 
triggered, but you get the idea.

So really just consider GSIs as a global flat number space for irqchip pins.


Alex

 In the XICS model we have a set of interrupt sources, each identified
 by a 24-bit number.  Control operations on an interrupt source just
 identify the source by its number.  Thus the interrupt source number
 is like a GSI, but we don't need to map that to a different space
 (e.g. IOAPIC identifier and input number) in order to operate on it,
 we can just operate on it directly.
 
 Paul.
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-03-21 Thread Alexander Graf

On 25.02.2013, at 01:59, Paul Mackerras wrote:

 On Wed, Feb 20, 2013 at 04:58:54PM -0300, Marcelo Tosatti wrote:
 
 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
 
 So, I see Scott already answered from the point of view of his MPIC
 emulation stuff, but I'll answer too from the point of view of my XICS
 emulation code.
 
 My understanding, possibly imperfect, is that in a real system the
 routing of GSIs to IOAPICs would either be hardwired or set up by the
 BIOS, described in ACPI tables, and not modified by the operating
 system.  Is that correct?  So my belief is that the GSI routing is
 fundamentally distinct from and handled differently from the routing
 of interrupts to CPUs, which is fully under the control of the OS.

It's a different layer. I guess there's really some confusion on names here :). 
I'm always confused when I read sources and you apparently get confused when 
you read about GSIs.

GSIs are an ACPI concept. It's not x86 specific, it's also not APIC specific. 
It's just a global name space for IRQs.

Imagine you have 2 MPICs in your system. But you only want to use a single 
token / numer to access any IRQ on any chip. That's where GSIs come into play. 
They map different irqchip IRQs onto a flat number space. To speak with x86 
names:

Virtualization perspective:

  QEMU - GSI - IOAPIC - LAPIC - CPU

Device perspective:

  Device irq line - IOAPIC - LAPIC - CPU


The IOAPIC is the piece of hardware that interrupt lines get attached to. You 
connect a pin on it to an irq pin of your device. That talks to the LAPIC to 
actually schedule interrupts on target CPUs. The LAPIC then fetches interrupts 
and pulls the CPU's interrupt line.

Of course, things are slightly more complicated in the x86 world, as everything 
behind the IOAPIC also carries a payload defining which pin actually got 
triggered, but you get the idea.

So really just consider GSIs as a global flat number space for irqchip pins.


Alex

 In the XICS model we have a set of interrupt sources, each identified
 by a 24-bit number.  Control operations on an interrupt source just
 identify the source by its number.  Thus the interrupt source number
 is like a GSI, but we don't need to map that to a different space
 (e.g. IOAPIC identifier and input number) in order to operate on it,
 we can just operate on it directly.
 
 Paul.
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-24 Thread Gleb Natapov
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
 On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
 
 To start, the whole IRQ routing stuff is poorly documented.
 
 Am I supposed to make up GSI numbers and use the routing thing to
 associate them with real interrupts?
You can consider GSI to be a cookie that you use to refer to whatever
data you've put into routing table by KVM_IRQ_LINE/irqfd interface.
Even on x86, when irq routing is used to inject MSI interrupt, this is
exactly how GSI is used. In MSI case it does not have a meaning besides
look at that interrupt entry to see what MSI should be injected.

   Are there constraints on what
 sort of GSI numbers I can choose (I now see in the code that
 KVM_MAX_IRQ_ROUTES is returned from the capability check, but where
 is that documented?
The only constrain is that the number should be smalled than
KVM_MAX_IRQ_ROUTES, but this is implementation detail. Current
implementation uses array to map from GSI to a data, if a lot more
entries then currently allowed is needed implementation may be changed
to different data structure.

  It looks like the APIC implementation has
 default routes, where is that documented?)?
It is very PC centric, should not be even compiled for other arches.

  Where does the code
 live to manage this table, and how APICy is it (looks like the
 answer is irq_comm.c, and very)?
It is a mistake to refer to the irq routing table as APICy :). It is
certainly PC centric currently, but there is at least one HW layer
between it and the APIC. PC has global GSI space, each GSI can be
delivered via different irq chip. Some GSIs can be delivered through
multiple irq chips. Irq routing table provides mapping between GSI and
irq chips it should be delivered through. Some irq chips deliver
interrupt via APIC some not, but this is different story.

The work is needed to make the code not PC centric, but it should not be
a lot of work.

 I suppose I could write another
 implementation of the table management code for MPIC, though the
 placement of irqchip inside the route entry, rather than as an
 argument to KVM_IRQ_LINE, suggests the table is supposed to be
 global, not in the individual interrupt controller.
 
Yes, it is global. It sits between emulated devices and irq chips.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-24 Thread Paul Mackerras
On Wed, Feb 20, 2013 at 04:58:54PM -0300, Marcelo Tosatti wrote:

 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.

So, I see Scott already answered from the point of view of his MPIC
emulation stuff, but I'll answer too from the point of view of my XICS
emulation code.

My understanding, possibly imperfect, is that in a real system the
routing of GSIs to IOAPICs would either be hardwired or set up by the
BIOS, described in ACPI tables, and not modified by the operating
system.  Is that correct?  So my belief is that the GSI routing is
fundamentally distinct from and handled differently from the routing
of interrupts to CPUs, which is fully under the control of the OS.

In the XICS model we have a set of interrupt sources, each identified
by a 24-bit number.  Control operations on an interrupt source just
identify the source by its number.  Thus the interrupt source number
is like a GSI, but we don't need to map that to a different space
(e.g. IOAPIC identifier and input number) in order to operate on it,
we can just operate on it directly.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-24 Thread Gleb Natapov
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
 On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
 
 To start, the whole IRQ routing stuff is poorly documented.
 
 Am I supposed to make up GSI numbers and use the routing thing to
 associate them with real interrupts?
You can consider GSI to be a cookie that you use to refer to whatever
data you've put into routing table by KVM_IRQ_LINE/irqfd interface.
Even on x86, when irq routing is used to inject MSI interrupt, this is
exactly how GSI is used. In MSI case it does not have a meaning besides
look at that interrupt entry to see what MSI should be injected.

   Are there constraints on what
 sort of GSI numbers I can choose (I now see in the code that
 KVM_MAX_IRQ_ROUTES is returned from the capability check, but where
 is that documented?
The only constrain is that the number should be smalled than
KVM_MAX_IRQ_ROUTES, but this is implementation detail. Current
implementation uses array to map from GSI to a data, if a lot more
entries then currently allowed is needed implementation may be changed
to different data structure.

  It looks like the APIC implementation has
 default routes, where is that documented?)?
It is very PC centric, should not be even compiled for other arches.

  Where does the code
 live to manage this table, and how APICy is it (looks like the
 answer is irq_comm.c, and very)?
It is a mistake to refer to the irq routing table as APICy :). It is
certainly PC centric currently, but there is at least one HW layer
between it and the APIC. PC has global GSI space, each GSI can be
delivered via different irq chip. Some GSIs can be delivered through
multiple irq chips. Irq routing table provides mapping between GSI and
irq chips it should be delivered through. Some irq chips deliver
interrupt via APIC some not, but this is different story.

The work is needed to make the code not PC centric, but it should not be
a lot of work.

 I suppose I could write another
 implementation of the table management code for MPIC, though the
 placement of irqchip inside the route entry, rather than as an
 argument to KVM_IRQ_LINE, suggests the table is supposed to be
 global, not in the individual interrupt controller.
 
Yes, it is global. It sits between emulated devices and irq chips.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Marcelo Tosatti
On Sat, Feb 16, 2013 at 01:56:14PM +1100, Paul Mackerras wrote:
 On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote:
  On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
  On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
   On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
   From: Benjamin Herrenschmidt b...@kernel.crashing.org
   
   This adds in-kernel emulation of the XICS (eXternal Interrupt
   Controller Specification) interrupt controller specified by
  PAPR, for
   both HV and PR KVM guests.
   
   This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
   KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
   should use in-kernel interrupt controller emulation, but also
  takes an
   argument struct that contains the type of interrupt controller
   architecture and an optional parameter.  Currently only one
  type value
   is defined, that which indicates the XICS architecture.
  
   Would the device config API I posted a couple days ago work for you?
  
  I suppose it could be made to work.  It doesn't feel like a natural
  fit though, because your API seems to assume (AFAICT) that a device is
  manipulated via registers at specific physical addresses, so I would
  have to invent an artificial set of registers with addresses and bit
  layouts, that aren't otherwise required.  The XICS is operated from
  the guest side via hcalls, not via emulated MMIO.
  
  I don't think it makes such an assumption.  The MPIC device has
  physical registers, so it exposes them, but it also exposes things
  that are not physical registers (e.g. the per-IRQ input state).  The
  generic device control layer leaves interpretation of attributes up
  to the device.
  
  I think it would be easier to fit XICS into the device control api
  model than to fit MPIC into this model, not to mention what would
  happen if we later want to emulate some other type of device -- x86
  already has at least one non-irqchip emulated device (i8254).
 
 I have no particular objection to the device control API per se, but
 I have two objections to using it as the primary interface to the XICS
 emulation.
 
 First, I dislike the magical side-effect where creating a device of a
 particular type (e.g. MPIC or XICS) automatically attaches it to the
 interrupt lines of the vcpus.  I prefer an explicit request to do
 in-kernel interrupt control. 

This is probably a stupid question, but why the
KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
your purposes?

x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.

  Further, the magic means that you can
 only have one instance of the device, whereas you might want to model
 the interrupt controller architecture as several devices.  You could
 do that using several device types, but then the interconnections
 between them would also be magic.
 
 Secondly, it means that we are completely abandoning any attempt to
 define an abstract or generic interface to in-kernel interrupt
 controller emulations.  Each device will have its own unique set of
 attribute groups and its own unique userspace code to drive it, with
 no commonality between them.

snip

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Scott Wood

On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:

This is probably a stupid question, but why the
KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
your purposes?

x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.


To start, the whole IRQ routing stuff is poorly documented.

Am I supposed to make up GSI numbers and use the routing thing to  
associate them with real interrupts?  Are there constraints on what  
sort of GSI numbers I can choose (I now see in the code that  
KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is  
that documented?  It looks like the APIC implementation has default  
routes, where is that documented?)?  Where does the code live to manage  
this table, and how APICy is it (looks like the answer is irq_comm.c,  
and very)?  I suppose I could write another implementation of the  
table management code for MPIC, though the placement of irqchip  
inside the route entry, rather than as an argument to KVM_IRQ_LINE,  
suggests the table is supposed to be global, not in the individual  
interrupt controller.


It looks like I'm going to have to do this anyway for irqfd, though  
that doesn't make the other uses of the device control api go away.   
Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the  
status for debugging (reading device registers doesn't quite do it,  
since the active bit won't show up if the interrupt is masked).  At  
that point, is it more offensive to make it read-only even though it  
would be trivial to make it read/write (which would allow users who  
don't need it to bypass the routing API), or to make it read/write and  
live with there being more than one way to do something?


KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of  
state, and because it doesn't allow debugging access to device  
registers (e.g. inspecting from the QEMU command line), and because  
it's hard to add new pieces of state if we realize we left something  
out.  It reminds be of GET/SET_SREGS.  With that, I did what you seem  
to want here, which is to adapt the existing interfaces, using feature  
flags to control optional state.  It ended up being a mess, and ONE_REG  
was introduced as a replacement.  The device control API is the  
equivalent of ONE_REG for things other than vcpus.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
 On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
 
 To start, the whole IRQ routing stuff is poorly documented.
 
 Am I supposed to make up GSI numbers and use the routing thing to
 associate them with real interrupts? 

I have no idea. Is mapping from one integer linear space (GSIs) 
to real interrupts suitable for you? 

 Are there constraints on what
 sort of GSI numbers I can choose (I now see in the code that
 KVM_MAX_IRQ_ROUTES is returned from the capability check, but where
 is that documented?  

Don't think it is.

 It looks like the APIC implementation has
 default routes, where is that documented?)? 

In the code.

 Where does the code live to manage this table, and how APICy is it (looks 
 like the
 answer is irq_comm.c, and very)? 

Thinking faster than typing? Not sure what you mean.

 I suppose I could write another
 implementation of the table management code for MPIC, though the
 placement of irqchip inside the route entry, rather than as an
 argument to KVM_IRQ_LINE, suggests the table is supposed to be
 global, not in the individual interrupt controller.

Yes the table is global. It maps GSI (Global System Interrupt IIRC)
(integer) to (irqchip,pin) pair.

 It looks like I'm going to have to do this anyway for irqfd, though
 that doesn't make the other uses of the device control api go away.
 Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading
 the status for debugging (reading device registers doesn't quite do
 it, since the active bit won't show up if the interrupt is
 masked).  

 At that point, is it more offensive to make it read-only
 even though it would be trivial to make it read/write (which would
 allow users who don't need it to bypass the routing API), or to make
 it read/write and live with there being more than one way to do
 something?

Can't follow this sentence.

 KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes
 of state, and because it doesn't allow debugging access to device
 registers (e.g. inspecting from the QEMU command line), and because
 it's hard to add new pieces of state if we realize we left something
 out.  It reminds be of GET/SET_SREGS.  With that, I did what you
 seem to want here, which is to adapt the existing interfaces, using
 feature flags to control optional state.  It ended up being a mess,
 and ONE_REG was introduced as a replacement.  The device control API
 is the equivalent of ONE_REG for things other than vcpus.
 
 -Scott

- ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2?
- Agree on poor extensibility of interface. Adding a reserved amount
of space as padding and versioning such as has been done so far 
is not acceptable? 
- Debugging: why is reading entire register state not acceptable? Yes,
  its slow.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Scott Wood

On 02/20/2013 07:09:55 PM, Marcelo Tosatti wrote:

On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
 On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate  
for

 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.

 To start, the whole IRQ routing stuff is poorly documented.

 Am I supposed to make up GSI numbers and use the routing thing to
 associate them with real interrupts?

I have no idea. Is mapping from one integer linear space (GSIs)
to real interrupts suitable for you?


I can live with it.

 Where does the code live to manage this table, and how APICy is it  
(looks like the

 answer is irq_comm.c, and very)?

Thinking faster than typing? Not sure what you mean.


Sorry...  The code to manage the table lives in virt/kvm/irq_comm.c.   
That code is very APIC-specific and not currently in a state that  
invites sharing.



 It looks like I'm going to have to do this anyway for irqfd, though
 that doesn't make the other uses of the device control api go away.
 Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading
 the status for debugging (reading device registers doesn't quite do
 it, since the active bit won't show up if the interrupt is
 masked).

 At that point, is it more offensive to make it read-only
 even though it would be trivial to make it read/write (which would
 allow users who don't need it to bypass the routing API), or to make
 it read/write and live with there being more than one way to do
 something?

Can't follow this sentence.


Suppose, for the sake of irqfd (and maillist tranquility) MPIC uses  
existing KVM_IRQ_LINE and such.  Will there be objection to being able  
to use KVM_GET_DEVICE_ATTR to *get* the irq line status for debugging  
purposes (maybe also useful for migration)?  If there's no objection to  
that, would there be any actual reason, beyond saving a few lines of  
glue code, to make it a read-only attribute?



 KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes
 of state, and because it doesn't allow debugging access to device
 registers (e.g. inspecting from the QEMU command line), and because
 it's hard to add new pieces of state if we realize we left something
 out.  It reminds be of GET/SET_SREGS.  With that, I did what you
 seem to want here, which is to adapt the existing interfaces, using
 feature flags to control optional state.  It ended up being a mess,
 and ONE_REG was introduced as a replacement.  The device control API
 is the equivalent of ONE_REG for things other than vcpus.

 -Scott

- ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2?


Well, that's what KVM_SET_DEVICE_ATTR is.


- Agree on poor extensibility of interface. Adding a reserved amount
of space as padding and versioning such as has been done so far
is not acceptable?


That's exactly what we did with SREGS, and it got declared a mess and  
replaced with ONE_REG.  I'm trying to learn from my mistakes. :-)



- Debugging: why is reading entire register state not acceptable? Yes,
  its slow.


For one, it's more work.  The current way works by simulating a guest  
MMIO access.  No blob format to design, implement, and keep compatible.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Scott Wood

On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:

This is probably a stupid question, but why the
KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
your purposes?

x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.


To start, the whole IRQ routing stuff is poorly documented.

Am I supposed to make up GSI numbers and use the routing thing to  
associate them with real interrupts?  Are there constraints on what  
sort of GSI numbers I can choose (I now see in the code that  
KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is  
that documented?  It looks like the APIC implementation has default  
routes, where is that documented?)?  Where does the code live to manage  
this table, and how APICy is it (looks like the answer is irq_comm.c,  
and very)?  I suppose I could write another implementation of the  
table management code for MPIC, though the placement of irqchip  
inside the route entry, rather than as an argument to KVM_IRQ_LINE,  
suggests the table is supposed to be global, not in the individual  
interrupt controller.


It looks like I'm going to have to do this anyway for irqfd, though  
that doesn't make the other uses of the device control api go away.   
Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the  
status for debugging (reading device registers doesn't quite do it,  
since the active bit won't show up if the interrupt is masked).  At  
that point, is it more offensive to make it read-only even though it  
would be trivial to make it read/write (which would allow users who  
don't need it to bypass the routing API), or to make it read/write and  
live with there being more than one way to do something?


KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of  
state, and because it doesn't allow debugging access to device  
registers (e.g. inspecting from the QEMU command line), and because  
it's hard to add new pieces of state if we realize we left something  
out.  It reminds be of GET/SET_SREGS.  With that, I did what you seem  
to want here, which is to adapt the existing interfaces, using feature  
flags to control optional state.  It ended up being a mess, and ONE_REG  
was introduced as a replacement.  The device control API is the  
equivalent of ONE_REG for things other than vcpus.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Marcelo Tosatti
On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
 On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
 
 To start, the whole IRQ routing stuff is poorly documented.
 
 Am I supposed to make up GSI numbers and use the routing thing to
 associate them with real interrupts? 

I have no idea. Is mapping from one integer linear space (GSIs) 
to real interrupts suitable for you? 

 Are there constraints on what
 sort of GSI numbers I can choose (I now see in the code that
 KVM_MAX_IRQ_ROUTES is returned from the capability check, but where
 is that documented?  

Don't think it is.

 It looks like the APIC implementation has
 default routes, where is that documented?)? 

In the code.

 Where does the code live to manage this table, and how APICy is it (looks 
 like the
 answer is irq_comm.c, and very)? 

Thinking faster than typing? Not sure what you mean.

 I suppose I could write another
 implementation of the table management code for MPIC, though the
 placement of irqchip inside the route entry, rather than as an
 argument to KVM_IRQ_LINE, suggests the table is supposed to be
 global, not in the individual interrupt controller.

Yes the table is global. It maps GSI (Global System Interrupt IIRC)
(integer) to (irqchip,pin) pair.

 It looks like I'm going to have to do this anyway for irqfd, though
 that doesn't make the other uses of the device control api go away.
 Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading
 the status for debugging (reading device registers doesn't quite do
 it, since the active bit won't show up if the interrupt is
 masked).  

 At that point, is it more offensive to make it read-only
 even though it would be trivial to make it read/write (which would
 allow users who don't need it to bypass the routing API), or to make
 it read/write and live with there being more than one way to do
 something?

Can't follow this sentence.

 KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes
 of state, and because it doesn't allow debugging access to device
 registers (e.g. inspecting from the QEMU command line), and because
 it's hard to add new pieces of state if we realize we left something
 out.  It reminds be of GET/SET_SREGS.  With that, I did what you
 seem to want here, which is to adapt the existing interfaces, using
 feature flags to control optional state.  It ended up being a mess,
 and ONE_REG was introduced as a replacement.  The device control API
 is the equivalent of ONE_REG for things other than vcpus.
 
 -Scott

- ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2?
- Agree on poor extensibility of interface. Adding a reserved amount
of space as padding and versioning such as has been done so far 
is not acceptable? 
- Debugging: why is reading entire register state not acceptable? Yes,
  its slow.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Scott Wood

On 02/20/2013 07:09:55 PM, Marcelo Tosatti wrote:

On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
 On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
 This is probably a stupid question, but why the
 KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate  
for

 your purposes?
 
 x86 sets up a default GSI-IRQCHIP PIN mapping on creation (during
 KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.

 To start, the whole IRQ routing stuff is poorly documented.

 Am I supposed to make up GSI numbers and use the routing thing to
 associate them with real interrupts?

I have no idea. Is mapping from one integer linear space (GSIs)
to real interrupts suitable for you?


I can live with it.

 Where does the code live to manage this table, and how APICy is it  
(looks like the

 answer is irq_comm.c, and very)?

Thinking faster than typing? Not sure what you mean.


Sorry...  The code to manage the table lives in virt/kvm/irq_comm.c.   
That code is very APIC-specific and not currently in a state that  
invites sharing.



 It looks like I'm going to have to do this anyway for irqfd, though
 that doesn't make the other uses of the device control api go away.
 Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading
 the status for debugging (reading device registers doesn't quite do
 it, since the active bit won't show up if the interrupt is
 masked).

 At that point, is it more offensive to make it read-only
 even though it would be trivial to make it read/write (which would
 allow users who don't need it to bypass the routing API), or to make
 it read/write and live with there being more than one way to do
 something?

Can't follow this sentence.


Suppose, for the sake of irqfd (and maillist tranquility) MPIC uses  
existing KVM_IRQ_LINE and such.  Will there be objection to being able  
to use KVM_GET_DEVICE_ATTR to *get* the irq line status for debugging  
purposes (maybe also useful for migration)?  If there's no objection to  
that, would there be any actual reason, beyond saving a few lines of  
glue code, to make it a read-only attribute?



 KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes
 of state, and because it doesn't allow debugging access to device
 registers (e.g. inspecting from the QEMU command line), and because
 it's hard to add new pieces of state if we realize we left something
 out.  It reminds be of GET/SET_SREGS.  With that, I did what you
 seem to want here, which is to adapt the existing interfaces, using
 feature flags to control optional state.  It ended up being a mess,
 and ONE_REG was introduced as a replacement.  The device control API
 is the equivalent of ONE_REG for things other than vcpus.

 -Scott

- ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2?


Well, that's what KVM_SET_DEVICE_ATTR is.


- Agree on poor extensibility of interface. Adding a reserved amount
of space as padding and versioning such as has been done so far
is not acceptable?


That's exactly what we did with SREGS, and it got declared a mess and  
replaced with ONE_REG.  I'm trying to learn from my mistakes. :-)



- Debugging: why is reading entire register state not acceptable? Yes,
  its slow.


For one, it's more work.  The current way works by simulating a guest  
MMIO access.  No blob format to design, implement, and keep compatible.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-19 Thread Paul Mackerras
On Mon, Feb 18, 2013 at 04:43:27PM -0600, Scott Wood wrote:
 On 02/15/2013 10:51:16 PM, Paul Mackerras wrote:
 The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
 specific interrupt controller architecture connected to the vcpus'
 external interrupt inputs.  In that sense it's explicit, compared to a
 generic create device ioctl that could be for any device.
 
 Hooking up to the CPU's interrupt lines is implicit in creating an
 MPIC (and I'm fine with changing that), not in creating any device.
 I don't see how it's worse than being implicit in calling
 KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips).

First, KVM_CREATE_IRQCHIP_ARGS specifies the overall architecture of
the interrupt control subsystem, so yes it does allow for cascaded
controllers.

Secondly, the difference is that if you see a KVM_CREATE_IRQCHIP_ARGS
call, you know that the vcpus' interrupt inputs will be driven by
kernel code.  If you see a KVM_CREATE_DEVICE call, you don't know
that; they might be, or they might not be.

 You're doing a round trip to userspace for every MPIC register access
 by the guest?  Seriously?
 
 No.  Accesses by the guest get handled in the kernel.  Accesses in
 QEMU, including MSIs generated by virtio, get forwarded to the
 kernel.

OK, I missed the path where that gets done, then.

 It would be the current task priority.  I assume MPIC maintains a
 16-bit map of the interrupt priorities in service, so that would need
 to be added.
 
 We don't maintain such a map in the emulation code.  We have a

Oh, so how do you handle EOI of nested interrupts?  How do you know
what to reset the CPU priority to in that case?

 per-CPU bitmap of the actual interrupt sources pending/active, which
 is another attribute that would need to be added in order to support
 migration on MPIC.

Not really, that can be recomputed from the sources easily enough.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-19 Thread Scott Wood

On 02/19/2013 06:41:11 PM, Paul Mackerras wrote:

On Mon, Feb 18, 2013 at 04:43:27PM -0600, Scott Wood wrote:
 On 02/15/2013 10:51:16 PM, Paul Mackerras wrote:
 The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
 specific interrupt controller architecture connected to the vcpus'
 external interrupt inputs.  In that sense it's explicit, compared  
to a

 generic create device ioctl that could be for any device.

 Hooking up to the CPU's interrupt lines is implicit in creating an
 MPIC (and I'm fine with changing that), not in creating any device.
 I don't see how it's worse than being implicit in calling
 KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips).

First, KVM_CREATE_IRQCHIP_ARGS specifies the overall architecture of
the interrupt control subsystem, so yes it does allow for cascaded
controllers.


Fine, but in that case you're dealing with a new irqchip (or irqarch if  
you prefer) type number (or param).  If you wanted to take that  
approach with cascaded MPICs (I wouldn't), you would create a new  
device type number.  I don't see the difference here.



Secondly, the difference is that if you see a KVM_CREATE_IRQCHIP_ARGS
call, you know that the vcpus' interrupt inputs will be driven by
kernel code.  If you see a KVM_CREATE_DEVICE call, you don't know
that; they might be, or they might not be.


I just don't understand what you mean here.  Nobody's suggesting that  
we make this assumption as soon as you see a KVM_CREATE_DEVICE call  
for any random device.  It's specifically in the creation of a  
KVM_DEV_TYPE_FSL_MPIC_20 or KVM_DEV_TYPE_FSL_MPIC_42 that this  
assumption is currently made.  I don't see how creation of one of those  
specific devices is any different from calling KVM_CREATE_IRQCHIP_ARGS  
in terms of the intent that can reasonably be inferred.


 You're doing a round trip to userspace for every MPIC register  
access

 by the guest?  Seriously?

 No.  Accesses by the guest get handled in the kernel.  Accesses in
 QEMU, including MSIs generated by virtio, get forwarded to the
 kernel.

OK, I missed the path where that gets done, then.


ppce500_pci has a memory region for e500-pci-bar0 which is an alias for  
the ccsr memory region.  The mpic registers are a child of the ccsr  
memory region.  When hw/kvm/mpic.c in QEMU sees an access to that mpic  
memory region, it forwards the access to the kernel via  
KVM_DEV_MPIC_GRP_REGISTER.



 It would be the current task priority.  I assume MPIC maintains a
 16-bit map of the interrupt priorities in service, so that would  
need

 to be added.

 We don't maintain such a map in the emulation code.  We have a

Oh, so how do you handle EOI of nested interrupts?


We scan the in-service bitmap for pending interrupts, and choose the  
one that has the highest priority as the one that the EOI must be  
referring to.


Just having a bitmap of priorities would only tell us the priority of  
the highest priority in-service interrupt, not which actual interrupt  
it is.



 How do you know what to reset the CPU priority to in that case?


Once we identify the interrupt that is being EOId as above, we clear  
that bit and check again to see if there's a remaining interrupt whose  
priority is high enough to prevent a pending interrupt from being  
delivered.



 per-CPU bitmap of the actual interrupt sources pending/active, which
 is another attribute that would need to be added in order to support
 migration on MPIC.

Not really, that can be recomputed from the sources easily enough.


I'm skeptical.  IPIs at least would be a problem, as would other  
multicast interrupts if we allowed that.  How would we distinguish an  
interrupt that is pending from one that is in-service, just from the  
sources?


In any case, it's a bit premature to discuss what we'd need for  
migration until QEMU itself can save/restore a normal QEMU openpic.   
When that time comes, attributes can be added for whatever extra state  
we need (if any) to extend that capability to in-kernel MPICs.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-19 Thread Paul Mackerras
On Mon, Feb 18, 2013 at 04:43:27PM -0600, Scott Wood wrote:
 On 02/15/2013 10:51:16 PM, Paul Mackerras wrote:
 The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
 specific interrupt controller architecture connected to the vcpus'
 external interrupt inputs.  In that sense it's explicit, compared to a
 generic create device ioctl that could be for any device.
 
 Hooking up to the CPU's interrupt lines is implicit in creating an
 MPIC (and I'm fine with changing that), not in creating any device.
 I don't see how it's worse than being implicit in calling
 KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips).

First, KVM_CREATE_IRQCHIP_ARGS specifies the overall architecture of
the interrupt control subsystem, so yes it does allow for cascaded
controllers.

Secondly, the difference is that if you see a KVM_CREATE_IRQCHIP_ARGS
call, you know that the vcpus' interrupt inputs will be driven by
kernel code.  If you see a KVM_CREATE_DEVICE call, you don't know
that; they might be, or they might not be.

 You're doing a round trip to userspace for every MPIC register access
 by the guest?  Seriously?
 
 No.  Accesses by the guest get handled in the kernel.  Accesses in
 QEMU, including MSIs generated by virtio, get forwarded to the
 kernel.

OK, I missed the path where that gets done, then.

 It would be the current task priority.  I assume MPIC maintains a
 16-bit map of the interrupt priorities in service, so that would need
 to be added.
 
 We don't maintain such a map in the emulation code.  We have a

Oh, so how do you handle EOI of nested interrupts?  How do you know
what to reset the CPU priority to in that case?

 per-CPU bitmap of the actual interrupt sources pending/active, which
 is another attribute that would need to be added in order to support
 migration on MPIC.

Not really, that can be recomputed from the sources easily enough.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-18 Thread Scott Wood

On 02/15/2013 10:51:16 PM, Paul Mackerras wrote:

On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
 On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
 I have no particular objection to the device control API per se,  
but
 I have two objections to using it as the primary interface to the  
XICS

 emulation.
 
 First, I dislike the magical side-effect where creating a device  
of a
 particular type (e.g. MPIC or XICS) automatically attaches it to  
the

 interrupt lines of the vcpus.  I prefer an explicit request to do
 in-kernel interrupt control.

 OK.  This is device-specific behavior, so you could define it
 differently for XICS than MPIC.  I suppose we could change it for
 MPIC as well, to leave an opening for the unlikely case where we'd
 want to model an MPIC that isn't directly connected to the CPUs.

You can have cascaded MPICs; I once had a powerbook that had one MPIC
cascaded into another.


OK.


 How is the explicit request made in this patchset?

The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
specific interrupt controller architecture connected to the vcpus'
external interrupt inputs.  In that sense it's explicit, compared to a
generic create device ioctl that could be for any device.


Hooking up to the CPU's interrupt lines is implicit in creating an MPIC  
(and I'm fine with changing that), not in creating any device.  I don't  
see how it's worse than being implicit in calling  
KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips).



 The standardized interface doesn't make things any easier (as noted
 above, the caller is already mpic-specific code), and we'd have to
 come up with a scheme for flattening our interrupt numberspace
 (rather than introduce new attribute groups for things like IPI and
 timer interrupts).  It may still be necessary when it comes to
 irqfd, though...

With 24 bits of interrupt source number, you don't have to flatten it
very far.  IPIs are handled separately and don't appear in the
interrupt source space.


They do need to appear in the interrupt source space if we want to  
inject or irqfd them.  Most users won't want to do that, but we have  
had a customer directly assign IPIs (to communicate with an OS running  
on the other CPU in an AMP setup -- host Linux was non-SMP so wasn't  
using them) and MPIC timers to a guest.



  How do MSIs get injected?
 
 Just like other interrupts - from the point of view of the  
interrupt

 controller they're edge-triggered interrupt sources.

 Ah right, I guess this is all set up via hcalls for XICS.

 With MPIC exposing its registers via the device control api,
 everything just works -- the PCI device generates a write to the
 MPIC's memory region, the QEMU MPIC stub sends the write to the
 kernel as for any other MMIO access (this passthrough is also useful
 for debugging), the in-kernel MPIC sees the write to the generate
 an MSI register and does its thing.  Compare that to all special
 the MSI code for APIC... :-)

You're doing a round trip to userspace for every MPIC register access
by the guest?  Seriously?


No.  Accesses by the guest get handled in the kernel.  Accesses in  
QEMU, including MSIs generated by virtio, get forwarded to the kernel.



 There are plenty of bits free in the 64 bits per source that I have
 allowed.  We can accommodate those things.

 MPIC vector numbers take up 16 of the bits.  The architected
 interrupt level field is 8 bits, though only a handful of values are
 actually needed.  Add a couple binary flags, and it gets pretty
 tight if a third type of interrupt controller starts wanting
 something new.

There's enough space for MPIC to have 16 bits of vector and some
flags.  We don't need to overdesign this.


I view anything other than passing the actual MPIC register values  
around as overdesign here, given that it is communication between  
hw/kvm/mpic.c on the QEMU side and arch/powerpc/kvm/mpic.c on the  
kernel side.



 interrupts awaiting service (or even in service, if different
 priorities).  We have both current task priority (which is a
 user-set mask-by-priority register) and the priority of the
 highest-prio in-service interrupt -- which would current processor
 priorty be?  Etc.

It would be the current task priority.  I assume MPIC maintains a
16-bit map of the interrupt priorities in service, so that would need
to be added.


We don't maintain such a map in the emulation code.  We have a per-CPU  
bitmap of the actual interrupt sources pending/active, which is another  
attribute that would need to be added in order to support migration on  
MPIC.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-18 Thread Scott Wood

On 02/15/2013 10:51:16 PM, Paul Mackerras wrote:

On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
 On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
 I have no particular objection to the device control API per se,  
but
 I have two objections to using it as the primary interface to the  
XICS

 emulation.
 
 First, I dislike the magical side-effect where creating a device  
of a
 particular type (e.g. MPIC or XICS) automatically attaches it to  
the

 interrupt lines of the vcpus.  I prefer an explicit request to do
 in-kernel interrupt control.

 OK.  This is device-specific behavior, so you could define it
 differently for XICS than MPIC.  I suppose we could change it for
 MPIC as well, to leave an opening for the unlikely case where we'd
 want to model an MPIC that isn't directly connected to the CPUs.

You can have cascaded MPICs; I once had a powerbook that had one MPIC
cascaded into another.


OK.


 How is the explicit request made in this patchset?

The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
specific interrupt controller architecture connected to the vcpus'
external interrupt inputs.  In that sense it's explicit, compared to a
generic create device ioctl that could be for any device.


Hooking up to the CPU's interrupt lines is implicit in creating an MPIC  
(and I'm fine with changing that), not in creating any device.  I don't  
see how it's worse than being implicit in calling  
KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips).



 The standardized interface doesn't make things any easier (as noted
 above, the caller is already mpic-specific code), and we'd have to
 come up with a scheme for flattening our interrupt numberspace
 (rather than introduce new attribute groups for things like IPI and
 timer interrupts).  It may still be necessary when it comes to
 irqfd, though...

With 24 bits of interrupt source number, you don't have to flatten it
very far.  IPIs are handled separately and don't appear in the
interrupt source space.


They do need to appear in the interrupt source space if we want to  
inject or irqfd them.  Most users won't want to do that, but we have  
had a customer directly assign IPIs (to communicate with an OS running  
on the other CPU in an AMP setup -- host Linux was non-SMP so wasn't  
using them) and MPIC timers to a guest.



  How do MSIs get injected?
 
 Just like other interrupts - from the point of view of the  
interrupt

 controller they're edge-triggered interrupt sources.

 Ah right, I guess this is all set up via hcalls for XICS.

 With MPIC exposing its registers via the device control api,
 everything just works -- the PCI device generates a write to the
 MPIC's memory region, the QEMU MPIC stub sends the write to the
 kernel as for any other MMIO access (this passthrough is also useful
 for debugging), the in-kernel MPIC sees the write to the generate
 an MSI register and does its thing.  Compare that to all special
 the MSI code for APIC... :-)

You're doing a round trip to userspace for every MPIC register access
by the guest?  Seriously?


No.  Accesses by the guest get handled in the kernel.  Accesses in  
QEMU, including MSIs generated by virtio, get forwarded to the kernel.



 There are plenty of bits free in the 64 bits per source that I have
 allowed.  We can accommodate those things.

 MPIC vector numbers take up 16 of the bits.  The architected
 interrupt level field is 8 bits, though only a handful of values are
 actually needed.  Add a couple binary flags, and it gets pretty
 tight if a third type of interrupt controller starts wanting
 something new.

There's enough space for MPIC to have 16 bits of vector and some
flags.  We don't need to overdesign this.


I view anything other than passing the actual MPIC register values  
around as overdesign here, given that it is communication between  
hw/kvm/mpic.c on the QEMU side and arch/powerpc/kvm/mpic.c on the  
kernel side.



 interrupts awaiting service (or even in service, if different
 priorities).  We have both current task priority (which is a
 user-set mask-by-priority register) and the priority of the
 highest-prio in-service interrupt -- which would current processor
 priorty be?  Etc.

It would be the current task priority.  I assume MPIC maintains a
16-bit map of the interrupt priorities in service, so that would need
to be added.


We don't maintain such a map in the emulation code.  We have a per-CPU  
bitmap of the actual interrupt sources pending/active, which is another  
attribute that would need to be added in order to support migration on  
MPIC.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:

From: Benjamin Herrenschmidt b...@kernel.crashing.org

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
should use in-kernel interrupt controller emulation, but also takes an
argument struct that contains the type of interrupt controller
architecture and an optional parameter.  Currently only one type value
is defined, that which indicates the XICS architecture.


Would the device config API I posted a couple days ago work for you?


The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs.  Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Two other new ioctls allow userspace to control the interrupt
sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
destination cpu, level/edge sensitivity and pending state of a range
of interrupt sources, creating them if they don't already exist.  The
KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
interrupt sources (they are required to already exist).


Why is it userspace's job to control these?  If you use KVM_IRQ_PENDING
for interrupt injection, what if there's a race with the user changing
other flags via MMIO?  Maybe this isn't an issue with XICS, but this is
being presented as a generic API.


+4.80 KVM_CREATE_IRQCHIP_ARGS
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc


Why just ppc?


+struct kvm_irq_sources {
+   __u32 irq;
+   __u32 nr_irqs;
+   __u64 __user *irqbuf;
+};


Please no pointers in UAPI -- this would require a compat wrapper with
32-bit user and 64-bit kernel.


+/* irqbuf entries are laid out like this: */
+#define KVM_IRQ_SERVER_SHIFT   0
+#define KVM_IRQ_SERVER_MASK0xULL
+#define KVM_IRQ_PRIORITY_SHIFT 32
+#define KVM_IRQ_PRIORITY_MASK  0xff
+#define KVM_IRQ_LEVEL_SENSITIVE(1ULL  40)
+#define KVM_IRQ_MASKED (1ULL  41)
+#define KVM_IRQ_PENDING(1ULL  42)


What does server mean?  Do you mean laid out like this for XICS?
Let's please have a clean separation between what is generic and what is
implementation-specific.

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
 On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
 From: Benjamin Herrenschmidt b...@kernel.crashing.org
 
 This adds in-kernel emulation of the XICS (eXternal Interrupt
 Controller Specification) interrupt controller specified by PAPR, for
 both HV and PR KVM guests.
 
 This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
 KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
 should use in-kernel interrupt controller emulation, but also takes an
 argument struct that contains the type of interrupt controller
 architecture and an optional parameter.  Currently only one type value
 is defined, that which indicates the XICS architecture.
 
 Would the device config API I posted a couple days ago work for you?

I suppose it could be made to work.  It doesn't feel like a natural
fit though, because your API seems to assume (AFAICT) that a device is
manipulated via registers at specific physical addresses, so I would
have to invent an artificial set of registers with addresses and bit
layouts, that aren't otherwise required.  The XICS is operated from
the guest side via hcalls, not via emulated MMIO.

Part of the reason I went with the API that I did is that that was
what was agreed on at KVM Forum (as far as I can tell, not having been
at the meeting).  Your device API seems to be quite different to that.

 The XICS emulation supports up to 1048560 interrupt sources.
 Interrupt source numbers below 16 are reserved; 0 is used to mean no
 interrupt and 2 is used for IPIs.  Internally these are represented in
 blocks of 1024, called ICS (interrupt controller source) entities, but
 that is not visible to userspace.
 
 Two other new ioctls allow userspace to control the interrupt
 sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
 destination cpu, level/edge sensitivity and pending state of a range
 of interrupt sources, creating them if they don't already exist.  The
 KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
 interrupt sources (they are required to already exist).
 
 Why is it userspace's job to control these?  If you use KVM_IRQ_PENDING

These are primarily there to support live migration.  For live
migration, userspace needs to be able to extract the entire state of
the virtual machine from the old guest, and then set the new guest to
that exact same state.  We have live migration working in qemu for
pSeries guests with in-kernel XICS emulation using this interface.
If you're not doing live migration, the only time userspace needs to
use these is at initialization, when it does a SET_SOURCES to create
the interrupt sources it wants the VM to have.

 for interrupt injection, what if there's a race with the user changing
 other flags via MMIO?  Maybe this isn't an issue with XICS, but this is
 being presented as a generic API.

They're not used while the guest is running, as I said, but even if
they were, there is appropriate locking in there to handle any races.

 +4.80 KVM_CREATE_IRQCHIP_ARGS
 +
 +Capability: KVM_CAP_IRQCHIP_ARGS
 +Architectures: ppc
 
 Why just ppc?

Just because only PPC has code to handle it at this point.  I would
hope that ARM and others could pick it up.  Maybe I should make it:

+Architectures: all (but so far only implemented on ppc)

or something.

 +struct kvm_irq_sources {
 +__u32 irq;
 +__u32 nr_irqs;
 +__u64 __user *irqbuf;
 +};
 
 Please no pointers in UAPI -- this would require a compat wrapper with
 32-bit user and 64-bit kernel.

Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,
which is generally less efficient, but since it's mainly for migration
that probably doesn't matter.  In fact, I could probably use the
existing KVM_GET_IRQCHIP and KVM_SET_IRQCHIP ioctls, if I use the
`chip_id' field for `irq' and the `pad' field for `nr_irqs'.

 +/* irqbuf entries are laid out like this: */
 +#define KVM_IRQ_SERVER_SHIFT0
 +#define KVM_IRQ_SERVER_MASK 0xULL
 +#define KVM_IRQ_PRIORITY_SHIFT  32
 +#define KVM_IRQ_PRIORITY_MASK   0xff
 +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL  40)
 +#define KVM_IRQ_MASKED  (1ULL  41)
 +#define KVM_IRQ_PENDING (1ULL  42)
 
 What does server mean?  Do you mean laid out like this for XICS?

Sorry, I should have made that destination rather than server.
You're right, server is confusing, but it just means where the
interrupt is sent to be handled.  It has nothing particularly to do
with server computers.

 Let's please have a clean separation between what is generic and what is
 implementation-specific.

I believe that the interface is pretty cleanly generic - the model is
a set of interrupt sources and some per-vcpu state, with priorities to
decide which interrupts get delivered when.  That describes the basics
of just about any SMP-capable interrupt controller, including MPIC.

MPIC would still need an extra interface for userspace to save and
restore 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:

On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
 On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
 From: Benjamin Herrenschmidt b...@kernel.crashing.org
 
 This adds in-kernel emulation of the XICS (eXternal Interrupt
 Controller Specification) interrupt controller specified by PAPR,  
for

 both HV and PR KVM guests.
 
 This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
 KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
 should use in-kernel interrupt controller emulation, but also  
takes an

 argument struct that contains the type of interrupt controller
 architecture and an optional parameter.  Currently only one type  
value

 is defined, that which indicates the XICS architecture.

 Would the device config API I posted a couple days ago work for you?

I suppose it could be made to work.  It doesn't feel like a natural
fit though, because your API seems to assume (AFAICT) that a device is
manipulated via registers at specific physical addresses, so I would
have to invent an artificial set of registers with addresses and bit
layouts, that aren't otherwise required.  The XICS is operated from
the guest side via hcalls, not via emulated MMIO.


I don't think it makes such an assumption.  The MPIC device has  
physical registers, so it exposes them, but it also exposes things that  
are not physical registers (e.g. the per-IRQ input state).  The generic  
device control layer leaves interpretation of attributes up to the  
device.


I think it would be easier to fit XICS into the device control api  
model than to fit MPIC into this model, not to mention what would  
happen if we later want to emulate some other type of device -- x86  
already has at least one non-irqchip emulated device (i8254).



Part of the reason I went with the API that I did is that that was
what was agreed on at KVM Forum (as far as I can tell, not having been
at the meeting).  Your device API seems to be quite different to that.


I wasn't there either.  It's fine to have discussions at such events,  
but it should not preclude others from having input, or better ideas  
from being considered afterward.



 The XICS emulation supports up to 1048560 interrupt sources.
 Interrupt source numbers below 16 are reserved; 0 is used to mean  
no
 interrupt and 2 is used for IPIs.  Internally these are  
represented in
 blocks of 1024, called ICS (interrupt controller source) entities,  
but

 that is not visible to userspace.
 
 Two other new ioctls allow userspace to control the interrupt
 sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
 destination cpu, level/edge sensitivity and pending state of a  
range
 of interrupt sources, creating them if they don't already exist.   
The
 KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range  
of

 interrupt sources (they are required to already exist).

 Why is it userspace's job to control these?  If you use  
KVM_IRQ_PENDING


These are primarily there to support live migration.  For live
migration, userspace needs to be able to extract the entire state of
the virtual machine from the old guest, and then set the new guest to
that exact same state.


In that case, the state it describes is insufficient for MPIC.


We have live migration working in qemu for
pSeries guests with in-kernel XICS emulation using this interface.
If you're not doing live migration,


We don't yet, but would prefer not to assume that it'll never happen.

 for interrupt injection, what if there's a race with the user  
changing
 other flags via MMIO?  Maybe this isn't an issue with XICS, but  
this is

 being presented as a generic API.

They're not used while the guest is running, as I said, but even if
they were, there is appropriate locking in there to handle any races.


OK, KVM_IRQ_LINE is still used for interrupt injection.  I was hoping  
to avoid going through a standardized interface that forces a global  
interrupt numberspace.


How do MSIs get injected?

BTW, do you have any plans regarding irqfd?


 +struct kvm_irq_sources {
 +  __u32 irq;
 +  __u32 nr_irqs;
 +  __u64 __user *irqbuf;
 +};

 Please no pointers in UAPI -- this would require a compat wrapper  
with

 32-bit user and 64-bit kernel.

Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,


It doesn't need to be a fixed size buffer.  You can still have  
pointers, but they need to be represented as a plain __u64 with users  
casting the pointer.



 +/* irqbuf entries are laid out like this: */
 +#define KVM_IRQ_SERVER_SHIFT  0
 +#define KVM_IRQ_SERVER_MASK   0xULL
 +#define KVM_IRQ_PRIORITY_SHIFT32
 +#define KVM_IRQ_PRIORITY_MASK 0xff
 +#define KVM_IRQ_LEVEL_SENSITIVE   (1ULL  40)
 +#define KVM_IRQ_MASKED(1ULL  41)
 +#define KVM_IRQ_PENDING   (1ULL  42)

 What does server mean?  Do you mean laid out like this for XICS?

Sorry, I should have made that 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote:
 On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
 On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
  On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
  From: Benjamin Herrenschmidt b...@kernel.crashing.org
  
  This adds in-kernel emulation of the XICS (eXternal Interrupt
  Controller Specification) interrupt controller specified by
 PAPR, for
  both HV and PR KVM guests.
  
  This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
  KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
  should use in-kernel interrupt controller emulation, but also
 takes an
  argument struct that contains the type of interrupt controller
  architecture and an optional parameter.  Currently only one
 type value
  is defined, that which indicates the XICS architecture.
 
  Would the device config API I posted a couple days ago work for you?
 
 I suppose it could be made to work.  It doesn't feel like a natural
 fit though, because your API seems to assume (AFAICT) that a device is
 manipulated via registers at specific physical addresses, so I would
 have to invent an artificial set of registers with addresses and bit
 layouts, that aren't otherwise required.  The XICS is operated from
 the guest side via hcalls, not via emulated MMIO.
 
 I don't think it makes such an assumption.  The MPIC device has
 physical registers, so it exposes them, but it also exposes things
 that are not physical registers (e.g. the per-IRQ input state).  The
 generic device control layer leaves interpretation of attributes up
 to the device.
 
 I think it would be easier to fit XICS into the device control api
 model than to fit MPIC into this model, not to mention what would
 happen if we later want to emulate some other type of device -- x86
 already has at least one non-irqchip emulated device (i8254).

I have no particular objection to the device control API per se, but
I have two objections to using it as the primary interface to the XICS
emulation.

First, I dislike the magical side-effect where creating a device of a
particular type (e.g. MPIC or XICS) automatically attaches it to the
interrupt lines of the vcpus.  I prefer an explicit request to do
in-kernel interrupt control.  Further, the magic means that you can
only have one instance of the device, whereas you might want to model
the interrupt controller architecture as several devices.  You could
do that using several device types, but then the interconnections
between them would also be magic.

Secondly, it means that we are completely abandoning any attempt to
define an abstract or generic interface to in-kernel interrupt
controller emulations.  Each device will have its own unique set of
attribute groups and its own unique userspace code to drive it, with
no commonality between them.

 Part of the reason I went with the API that I did is that that was
 what was agreed on at KVM Forum (as far as I can tell, not having been
 at the meeting).  Your device API seems to be quite different to that.
 
 I wasn't there either.  It's fine to have discussions at such
 events, but it should not preclude others from having input, or
 better ideas from being considered afterward.

Sure - I was just trying to fit in with the expressed wish of the
maintainer.

  The XICS emulation supports up to 1048560 interrupt sources.
  Interrupt source numbers below 16 are reserved; 0 is used to
 mean no
  interrupt and 2 is used for IPIs.  Internally these are
 represented in
  blocks of 1024, called ICS (interrupt controller source)
 entities, but
  that is not visible to userspace.
  
  Two other new ioctls allow userspace to control the interrupt
  sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
  destination cpu, level/edge sensitivity and pending state of a
 range
  of interrupt sources, creating them if they don't already
 exist.  The
  KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a
 range of
  interrupt sources (they are required to already exist).
 
  Why is it userspace's job to control these?  If you use
 KVM_IRQ_PENDING
 
 These are primarily there to support live migration.  For live
 migration, userspace needs to be able to extract the entire state of
 the virtual machine from the old guest, and then set the new guest to
 that exact same state.
 
 In that case, the state it describes is insufficient for MPIC.

Yes, MPIC has other random stuff in it besides interrupt control, so
that's not surprising.

 We have live migration working in qemu for
 pSeries guests with in-kernel XICS emulation using this interface.
 If you're not doing live migration,
 
 We don't yet, but would prefer not to assume that it'll never happen.
 
  for interrupt injection, what if there's a race with the user
 changing
  other flags via MMIO?  Maybe this isn't an issue with XICS, but
 this is
  being presented as a generic API.
 
 They're not used while the guest is running, as I said, but even if
 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:

I have no particular objection to the device control API per se, but
I have two objections to using it as the primary interface to the XICS
emulation.

First, I dislike the magical side-effect where creating a device of a
particular type (e.g. MPIC or XICS) automatically attaches it to the
interrupt lines of the vcpus.  I prefer an explicit request to do
in-kernel interrupt control.


OK.  This is device-specific behavior, so you could define it  
differently for XICS than MPIC.  I suppose we could change it for MPIC  
as well, to leave an opening for the unlikely case where we'd want to  
model an MPIC that isn't directly connected to the CPUs.


How is the explicit request made in this patchset?


Secondly, it means that we are completely abandoning any attempt to
define an abstract or generic interface to in-kernel interrupt
controller emulations.  Each device will have its own unique set of
attribute groups and its own unique userspace code to drive it, with
no commonality between them.


Yes.  I am unconvinced that such an abstraction is well-advised  
(especially after seeing existing generic interfaces that are clearly  
APIC-oriented).  This isn't like normal driver interfaces where we're  
abstracting away hardware differences to let generic code use a  
device.  Userspace knows what kind of device it wants, and how it wants  
it to integrate with the rest of the emulated system.  We'd have to go  
out of our way to apply the abstraction on *both* ends.  What do we get  
from that other than a chance that the abstraction leaks?  What  
significant code actually becomes common?  kvm_set_irq() is just a thin  
wrapper around the ioctl.



 We have live migration working in qemu for
 pSeries guests with in-kernel XICS emulation using this interface.
 If you're not doing live migration,

 We don't yet, but would prefer not to assume that it'll never  
happen.


  for interrupt injection, what if there's a race with the user
 changing
  other flags via MMIO?  Maybe this isn't an issue with XICS, but
 this is
  being presented as a generic API.
 
 They're not used while the guest is running, as I said, but even if
 they were, there is appropriate locking in there to handle any  
races.


 OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
 hoping to avoid going through a standardized interface that forces a
 global interrupt numberspace.

Why?


The standardized interface doesn't make things any easier (as noted  
above, the caller is already mpic-specific code), and we'd have to come  
up with a scheme for flattening our interrupt numberspace (rather than  
introduce new attribute groups for things like IPI and timer  
interrupts).  It may still be necessary when it comes to irqfd,  
though...



 How do MSIs get injected?

Just like other interrupts - from the point of view of the interrupt
controller they're edge-triggered interrupt sources.


Ah right, I guess this is all set up via hcalls for XICS.

With MPIC exposing its registers via the device control api, everything  
just works -- the PCI device generates a write to the MPIC's memory  
region, the QEMU MPIC stub sends the write to the kernel as for any  
other MMIO access (this passthrough is also useful for debugging), the  
in-kernel MPIC sees the write to the generate an MSI register and  
does its thing.  Compare that to all special the MSI code for APIC...  
:-)



 BTW, do you have any plans regarding irqfd?

I'm going to look at that next.


Likewise...  We should probably coordinate our efforts so that at least  
the de-APICization of the code only has to get done once.



 What about interrupt controllers that allow multiple destinations?

The destination can be an identifier for a group of vcpus, or even a
bitmap -- that's why I made it 32 bits.


So you can have single delivery, or be limited to 32 vcpus, or have to  
implement some destination ID allocation scheme (which is more state  
that needs to be accessible somehow).



 More than 256 priorities?  Different levels of output (normal,
 critical, machine check)?  Programmable vector numbers?  Active
 high/low control?

There are plenty of bits free in the 64 bits per source that I have
allowed.  We can accommodate those things.


MPIC vector numbers take up 16 of the bits.  The architected interrupt  
level field is 8 bits, though only a handful of values are actually  
needed.  Add a couple binary flags, and it gets pretty tight if a third  
type of interrupt controller starts wanting something new.



 (BTW, I think having more
than 256 priorities would be insane - do you know of any actual
example that does?)


No, but hardware designers have been known to do insane things.


 The per-vcpu state isn't even part of this AFAICT.  It's an
 XICS-specific ONE_REG -- which is fine, but all that's left of the
 generic API is the get/set sources which is an imperfect match to
 our per-IRQ state and it's not clear how an 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
 On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
 I have no particular objection to the device control API per se, but
 I have two objections to using it as the primary interface to the XICS
 emulation.
 
 First, I dislike the magical side-effect where creating a device of a
 particular type (e.g. MPIC or XICS) automatically attaches it to the
 interrupt lines of the vcpus.  I prefer an explicit request to do
 in-kernel interrupt control.
 
 OK.  This is device-specific behavior, so you could define it
 differently for XICS than MPIC.  I suppose we could change it for
 MPIC as well, to leave an opening for the unlikely case where we'd
 want to model an MPIC that isn't directly connected to the CPUs.

You can have cascaded MPICs; I once had a powerbook that had one MPIC
cascaded into another.

 How is the explicit request made in this patchset?

The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
specific interrupt controller architecture connected to the vcpus'
external interrupt inputs.  In that sense it's explicit, compared to a
generic create device ioctl that could be for any device.

 Secondly, it means that we are completely abandoning any attempt to
 define an abstract or generic interface to in-kernel interrupt
 controller emulations.  Each device will have its own unique set of
 attribute groups and its own unique userspace code to drive it, with
 no commonality between them.
 
 Yes.  I am unconvinced that such an abstraction is well-advised
 (especially after seeing existing generic interfaces that are
 clearly APIC-oriented).  This isn't like normal driver interfaces
 where we're abstracting away hardware differences to let generic
 code use a device.  Userspace knows what kind of device it wants,
 and how it wants it to integrate with the rest of the emulated
 system.  We'd have to go out of our way to apply the abstraction on
 *both* ends.  What do we get from that other than a chance that the
 abstraction leaks?  What significant code actually becomes common?
 kvm_set_irq() is just a thin wrapper around the ioctl.

I'd think there could be some code reduction in the live migration
code, but I'd need a qemu hacker to chime in here.

  OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
  hoping to avoid going through a standardized interface that forces a
  global interrupt numberspace.
 
 Why?
 
 The standardized interface doesn't make things any easier (as noted
 above, the caller is already mpic-specific code), and we'd have to
 come up with a scheme for flattening our interrupt numberspace
 (rather than introduce new attribute groups for things like IPI and
 timer interrupts).  It may still be necessary when it comes to
 irqfd, though...

With 24 bits of interrupt source number, you don't have to flatten it
very far.  IPIs are handled separately and don't appear in the
interrupt source space.

  How do MSIs get injected?
 
 Just like other interrupts - from the point of view of the interrupt
 controller they're edge-triggered interrupt sources.
 
 Ah right, I guess this is all set up via hcalls for XICS.
 
 With MPIC exposing its registers via the device control api,
 everything just works -- the PCI device generates a write to the
 MPIC's memory region, the QEMU MPIC stub sends the write to the
 kernel as for any other MMIO access (this passthrough is also useful
 for debugging), the in-kernel MPIC sees the write to the generate
 an MSI register and does its thing.  Compare that to all special
 the MSI code for APIC... :-)

You're doing a round trip to userspace for every MPIC register access
by the guest?  Seriously?  If that's so, then why bother with
in-kernel emulation at all?  Once you're back in userspace, it's just
as fast to do the emulation there as in the kernel.

 There are plenty of bits free in the 64 bits per source that I have
 allowed.  We can accommodate those things.
 
 MPIC vector numbers take up 16 of the bits.  The architected
 interrupt level field is 8 bits, though only a handful of values are
 actually needed.  Add a couple binary flags, and it gets pretty
 tight if a third type of interrupt controller starts wanting
 something new.

There's enough space for MPIC to have 16 bits of vector and some
flags.  We don't need to overdesign this.

 Yes, the names of the bitfields in the ICP state word are
 XICS-specific, but the concepts are pretty generic - current processor
 priority, identifier for interrupt awaiting service, pending IPI
 request priority, pending interrupt request priority.
 
 We don't have separate concepts of pending IPI request priority
 and pending interrupt request priority.  There can be multiple

Sorry, I meant pending interrupt request.  You do have that, it's
what you read from the interrupt acknowledge register.

 interrupts awaiting service (or even in service, if different
 priorities).  We have both current task priority (which is a
 user-set 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:

From: Benjamin Herrenschmidt b...@kernel.crashing.org

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
should use in-kernel interrupt controller emulation, but also takes an
argument struct that contains the type of interrupt controller
architecture and an optional parameter.  Currently only one type value
is defined, that which indicates the XICS architecture.


Would the device config API I posted a couple days ago work for you?


The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs.  Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Two other new ioctls allow userspace to control the interrupt
sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
destination cpu, level/edge sensitivity and pending state of a range
of interrupt sources, creating them if they don't already exist.  The
KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
interrupt sources (they are required to already exist).


Why is it userspace's job to control these?  If you use KVM_IRQ_PENDING
for interrupt injection, what if there's a race with the user changing
other flags via MMIO?  Maybe this isn't an issue with XICS, but this is
being presented as a generic API.


+4.80 KVM_CREATE_IRQCHIP_ARGS
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc


Why just ppc?


+struct kvm_irq_sources {
+   __u32 irq;
+   __u32 nr_irqs;
+   __u64 __user *irqbuf;
+};


Please no pointers in UAPI -- this would require a compat wrapper with
32-bit user and 64-bit kernel.


+/* irqbuf entries are laid out like this: */
+#define KVM_IRQ_SERVER_SHIFT   0
+#define KVM_IRQ_SERVER_MASK0xULL
+#define KVM_IRQ_PRIORITY_SHIFT 32
+#define KVM_IRQ_PRIORITY_MASK  0xff
+#define KVM_IRQ_LEVEL_SENSITIVE(1ULL  40)
+#define KVM_IRQ_MASKED (1ULL  41)
+#define KVM_IRQ_PENDING(1ULL  42)


What does server mean?  Do you mean laid out like this for XICS?
Let's please have a clean separation between what is generic and what is
implementation-specific.

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
 On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
 From: Benjamin Herrenschmidt b...@kernel.crashing.org
 
 This adds in-kernel emulation of the XICS (eXternal Interrupt
 Controller Specification) interrupt controller specified by PAPR, for
 both HV and PR KVM guests.
 
 This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
 KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
 should use in-kernel interrupt controller emulation, but also takes an
 argument struct that contains the type of interrupt controller
 architecture and an optional parameter.  Currently only one type value
 is defined, that which indicates the XICS architecture.
 
 Would the device config API I posted a couple days ago work for you?

I suppose it could be made to work.  It doesn't feel like a natural
fit though, because your API seems to assume (AFAICT) that a device is
manipulated via registers at specific physical addresses, so I would
have to invent an artificial set of registers with addresses and bit
layouts, that aren't otherwise required.  The XICS is operated from
the guest side via hcalls, not via emulated MMIO.

Part of the reason I went with the API that I did is that that was
what was agreed on at KVM Forum (as far as I can tell, not having been
at the meeting).  Your device API seems to be quite different to that.

 The XICS emulation supports up to 1048560 interrupt sources.
 Interrupt source numbers below 16 are reserved; 0 is used to mean no
 interrupt and 2 is used for IPIs.  Internally these are represented in
 blocks of 1024, called ICS (interrupt controller source) entities, but
 that is not visible to userspace.
 
 Two other new ioctls allow userspace to control the interrupt
 sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
 destination cpu, level/edge sensitivity and pending state of a range
 of interrupt sources, creating them if they don't already exist.  The
 KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
 interrupt sources (they are required to already exist).
 
 Why is it userspace's job to control these?  If you use KVM_IRQ_PENDING

These are primarily there to support live migration.  For live
migration, userspace needs to be able to extract the entire state of
the virtual machine from the old guest, and then set the new guest to
that exact same state.  We have live migration working in qemu for
pSeries guests with in-kernel XICS emulation using this interface.
If you're not doing live migration, the only time userspace needs to
use these is at initialization, when it does a SET_SOURCES to create
the interrupt sources it wants the VM to have.

 for interrupt injection, what if there's a race with the user changing
 other flags via MMIO?  Maybe this isn't an issue with XICS, but this is
 being presented as a generic API.

They're not used while the guest is running, as I said, but even if
they were, there is appropriate locking in there to handle any races.

 +4.80 KVM_CREATE_IRQCHIP_ARGS
 +
 +Capability: KVM_CAP_IRQCHIP_ARGS
 +Architectures: ppc
 
 Why just ppc?

Just because only PPC has code to handle it at this point.  I would
hope that ARM and others could pick it up.  Maybe I should make it:

+Architectures: all (but so far only implemented on ppc)

or something.

 +struct kvm_irq_sources {
 +__u32 irq;
 +__u32 nr_irqs;
 +__u64 __user *irqbuf;
 +};
 
 Please no pointers in UAPI -- this would require a compat wrapper with
 32-bit user and 64-bit kernel.

Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,
which is generally less efficient, but since it's mainly for migration
that probably doesn't matter.  In fact, I could probably use the
existing KVM_GET_IRQCHIP and KVM_SET_IRQCHIP ioctls, if I use the
`chip_id' field for `irq' and the `pad' field for `nr_irqs'.

 +/* irqbuf entries are laid out like this: */
 +#define KVM_IRQ_SERVER_SHIFT0
 +#define KVM_IRQ_SERVER_MASK 0xULL
 +#define KVM_IRQ_PRIORITY_SHIFT  32
 +#define KVM_IRQ_PRIORITY_MASK   0xff
 +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL  40)
 +#define KVM_IRQ_MASKED  (1ULL  41)
 +#define KVM_IRQ_PENDING (1ULL  42)
 
 What does server mean?  Do you mean laid out like this for XICS?

Sorry, I should have made that destination rather than server.
You're right, server is confusing, but it just means where the
interrupt is sent to be handled.  It has nothing particularly to do
with server computers.

 Let's please have a clean separation between what is generic and what is
 implementation-specific.

I believe that the interface is pretty cleanly generic - the model is
a set of interrupt sources and some per-vcpu state, with priorities to
decide which interrupts get delivered when.  That describes the basics
of just about any SMP-capable interrupt controller, including MPIC.

MPIC would still need an extra interface for userspace to save and
restore 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:

On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
 On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
 From: Benjamin Herrenschmidt b...@kernel.crashing.org
 
 This adds in-kernel emulation of the XICS (eXternal Interrupt
 Controller Specification) interrupt controller specified by PAPR,  
for

 both HV and PR KVM guests.
 
 This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
 KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
 should use in-kernel interrupt controller emulation, but also  
takes an

 argument struct that contains the type of interrupt controller
 architecture and an optional parameter.  Currently only one type  
value

 is defined, that which indicates the XICS architecture.

 Would the device config API I posted a couple days ago work for you?

I suppose it could be made to work.  It doesn't feel like a natural
fit though, because your API seems to assume (AFAICT) that a device is
manipulated via registers at specific physical addresses, so I would
have to invent an artificial set of registers with addresses and bit
layouts, that aren't otherwise required.  The XICS is operated from
the guest side via hcalls, not via emulated MMIO.


I don't think it makes such an assumption.  The MPIC device has  
physical registers, so it exposes them, but it also exposes things that  
are not physical registers (e.g. the per-IRQ input state).  The generic  
device control layer leaves interpretation of attributes up to the  
device.


I think it would be easier to fit XICS into the device control api  
model than to fit MPIC into this model, not to mention what would  
happen if we later want to emulate some other type of device -- x86  
already has at least one non-irqchip emulated device (i8254).



Part of the reason I went with the API that I did is that that was
what was agreed on at KVM Forum (as far as I can tell, not having been
at the meeting).  Your device API seems to be quite different to that.


I wasn't there either.  It's fine to have discussions at such events,  
but it should not preclude others from having input, or better ideas  
from being considered afterward.



 The XICS emulation supports up to 1048560 interrupt sources.
 Interrupt source numbers below 16 are reserved; 0 is used to mean  
no
 interrupt and 2 is used for IPIs.  Internally these are  
represented in
 blocks of 1024, called ICS (interrupt controller source) entities,  
but

 that is not visible to userspace.
 
 Two other new ioctls allow userspace to control the interrupt
 sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
 destination cpu, level/edge sensitivity and pending state of a  
range
 of interrupt sources, creating them if they don't already exist.   
The
 KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range  
of

 interrupt sources (they are required to already exist).

 Why is it userspace's job to control these?  If you use  
KVM_IRQ_PENDING


These are primarily there to support live migration.  For live
migration, userspace needs to be able to extract the entire state of
the virtual machine from the old guest, and then set the new guest to
that exact same state.


In that case, the state it describes is insufficient for MPIC.


We have live migration working in qemu for
pSeries guests with in-kernel XICS emulation using this interface.
If you're not doing live migration,


We don't yet, but would prefer not to assume that it'll never happen.

 for interrupt injection, what if there's a race with the user  
changing
 other flags via MMIO?  Maybe this isn't an issue with XICS, but  
this is

 being presented as a generic API.

They're not used while the guest is running, as I said, but even if
they were, there is appropriate locking in there to handle any races.


OK, KVM_IRQ_LINE is still used for interrupt injection.  I was hoping  
to avoid going through a standardized interface that forces a global  
interrupt numberspace.


How do MSIs get injected?

BTW, do you have any plans regarding irqfd?


 +struct kvm_irq_sources {
 +  __u32 irq;
 +  __u32 nr_irqs;
 +  __u64 __user *irqbuf;
 +};

 Please no pointers in UAPI -- this would require a compat wrapper  
with

 32-bit user and 64-bit kernel.

Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,


It doesn't need to be a fixed size buffer.  You can still have  
pointers, but they need to be represented as a plain __u64 with users  
casting the pointer.



 +/* irqbuf entries are laid out like this: */
 +#define KVM_IRQ_SERVER_SHIFT  0
 +#define KVM_IRQ_SERVER_MASK   0xULL
 +#define KVM_IRQ_PRIORITY_SHIFT32
 +#define KVM_IRQ_PRIORITY_MASK 0xff
 +#define KVM_IRQ_LEVEL_SENSITIVE   (1ULL  40)
 +#define KVM_IRQ_MASKED(1ULL  41)
 +#define KVM_IRQ_PENDING   (1ULL  42)

 What does server mean?  Do you mean laid out like this for XICS?

Sorry, I should have made that 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote:
 On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
 On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
  On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
  From: Benjamin Herrenschmidt b...@kernel.crashing.org
  
  This adds in-kernel emulation of the XICS (eXternal Interrupt
  Controller Specification) interrupt controller specified by
 PAPR, for
  both HV and PR KVM guests.
  
  This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
  KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
  should use in-kernel interrupt controller emulation, but also
 takes an
  argument struct that contains the type of interrupt controller
  architecture and an optional parameter.  Currently only one
 type value
  is defined, that which indicates the XICS architecture.
 
  Would the device config API I posted a couple days ago work for you?
 
 I suppose it could be made to work.  It doesn't feel like a natural
 fit though, because your API seems to assume (AFAICT) that a device is
 manipulated via registers at specific physical addresses, so I would
 have to invent an artificial set of registers with addresses and bit
 layouts, that aren't otherwise required.  The XICS is operated from
 the guest side via hcalls, not via emulated MMIO.
 
 I don't think it makes such an assumption.  The MPIC device has
 physical registers, so it exposes them, but it also exposes things
 that are not physical registers (e.g. the per-IRQ input state).  The
 generic device control layer leaves interpretation of attributes up
 to the device.
 
 I think it would be easier to fit XICS into the device control api
 model than to fit MPIC into this model, not to mention what would
 happen if we later want to emulate some other type of device -- x86
 already has at least one non-irqchip emulated device (i8254).

I have no particular objection to the device control API per se, but
I have two objections to using it as the primary interface to the XICS
emulation.

First, I dislike the magical side-effect where creating a device of a
particular type (e.g. MPIC or XICS) automatically attaches it to the
interrupt lines of the vcpus.  I prefer an explicit request to do
in-kernel interrupt control.  Further, the magic means that you can
only have one instance of the device, whereas you might want to model
the interrupt controller architecture as several devices.  You could
do that using several device types, but then the interconnections
between them would also be magic.

Secondly, it means that we are completely abandoning any attempt to
define an abstract or generic interface to in-kernel interrupt
controller emulations.  Each device will have its own unique set of
attribute groups and its own unique userspace code to drive it, with
no commonality between them.

 Part of the reason I went with the API that I did is that that was
 what was agreed on at KVM Forum (as far as I can tell, not having been
 at the meeting).  Your device API seems to be quite different to that.
 
 I wasn't there either.  It's fine to have discussions at such
 events, but it should not preclude others from having input, or
 better ideas from being considered afterward.

Sure - I was just trying to fit in with the expressed wish of the
maintainer.

  The XICS emulation supports up to 1048560 interrupt sources.
  Interrupt source numbers below 16 are reserved; 0 is used to
 mean no
  interrupt and 2 is used for IPIs.  Internally these are
 represented in
  blocks of 1024, called ICS (interrupt controller source)
 entities, but
  that is not visible to userspace.
  
  Two other new ioctls allow userspace to control the interrupt
  sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
  destination cpu, level/edge sensitivity and pending state of a
 range
  of interrupt sources, creating them if they don't already
 exist.  The
  KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a
 range of
  interrupt sources (they are required to already exist).
 
  Why is it userspace's job to control these?  If you use
 KVM_IRQ_PENDING
 
 These are primarily there to support live migration.  For live
 migration, userspace needs to be able to extract the entire state of
 the virtual machine from the old guest, and then set the new guest to
 that exact same state.
 
 In that case, the state it describes is insufficient for MPIC.

Yes, MPIC has other random stuff in it besides interrupt control, so
that's not surprising.

 We have live migration working in qemu for
 pSeries guests with in-kernel XICS emulation using this interface.
 If you're not doing live migration,
 
 We don't yet, but would prefer not to assume that it'll never happen.
 
  for interrupt injection, what if there's a race with the user
 changing
  other flags via MMIO?  Maybe this isn't an issue with XICS, but
 this is
  being presented as a generic API.
 
 They're not used while the guest is running, as I said, but even if
 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:

I have no particular objection to the device control API per se, but
I have two objections to using it as the primary interface to the XICS
emulation.

First, I dislike the magical side-effect where creating a device of a
particular type (e.g. MPIC or XICS) automatically attaches it to the
interrupt lines of the vcpus.  I prefer an explicit request to do
in-kernel interrupt control.


OK.  This is device-specific behavior, so you could define it  
differently for XICS than MPIC.  I suppose we could change it for MPIC  
as well, to leave an opening for the unlikely case where we'd want to  
model an MPIC that isn't directly connected to the CPUs.


How is the explicit request made in this patchset?


Secondly, it means that we are completely abandoning any attempt to
define an abstract or generic interface to in-kernel interrupt
controller emulations.  Each device will have its own unique set of
attribute groups and its own unique userspace code to drive it, with
no commonality between them.


Yes.  I am unconvinced that such an abstraction is well-advised  
(especially after seeing existing generic interfaces that are clearly  
APIC-oriented).  This isn't like normal driver interfaces where we're  
abstracting away hardware differences to let generic code use a  
device.  Userspace knows what kind of device it wants, and how it wants  
it to integrate with the rest of the emulated system.  We'd have to go  
out of our way to apply the abstraction on *both* ends.  What do we get  
from that other than a chance that the abstraction leaks?  What  
significant code actually becomes common?  kvm_set_irq() is just a thin  
wrapper around the ioctl.



 We have live migration working in qemu for
 pSeries guests with in-kernel XICS emulation using this interface.
 If you're not doing live migration,

 We don't yet, but would prefer not to assume that it'll never  
happen.


  for interrupt injection, what if there's a race with the user
 changing
  other flags via MMIO?  Maybe this isn't an issue with XICS, but
 this is
  being presented as a generic API.
 
 They're not used while the guest is running, as I said, but even if
 they were, there is appropriate locking in there to handle any  
races.


 OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
 hoping to avoid going through a standardized interface that forces a
 global interrupt numberspace.

Why?


The standardized interface doesn't make things any easier (as noted  
above, the caller is already mpic-specific code), and we'd have to come  
up with a scheme for flattening our interrupt numberspace (rather than  
introduce new attribute groups for things like IPI and timer  
interrupts).  It may still be necessary when it comes to irqfd,  
though...



 How do MSIs get injected?

Just like other interrupts - from the point of view of the interrupt
controller they're edge-triggered interrupt sources.


Ah right, I guess this is all set up via hcalls for XICS.

With MPIC exposing its registers via the device control api, everything  
just works -- the PCI device generates a write to the MPIC's memory  
region, the QEMU MPIC stub sends the write to the kernel as for any  
other MMIO access (this passthrough is also useful for debugging), the  
in-kernel MPIC sees the write to the generate an MSI register and  
does its thing.  Compare that to all special the MSI code for APIC...  
:-)



 BTW, do you have any plans regarding irqfd?

I'm going to look at that next.


Likewise...  We should probably coordinate our efforts so that at least  
the de-APICization of the code only has to get done once.



 What about interrupt controllers that allow multiple destinations?

The destination can be an identifier for a group of vcpus, or even a
bitmap -- that's why I made it 32 bits.


So you can have single delivery, or be limited to 32 vcpus, or have to  
implement some destination ID allocation scheme (which is more state  
that needs to be accessible somehow).



 More than 256 priorities?  Different levels of output (normal,
 critical, machine check)?  Programmable vector numbers?  Active
 high/low control?

There are plenty of bits free in the 64 bits per source that I have
allowed.  We can accommodate those things.


MPIC vector numbers take up 16 of the bits.  The architected interrupt  
level field is 8 bits, though only a handful of values are actually  
needed.  Add a couple binary flags, and it gets pretty tight if a third  
type of interrupt controller starts wanting something new.



 (BTW, I think having more
than 256 priorities would be insane - do you know of any actual
example that does?)


No, but hardware designers have been known to do insane things.


 The per-vcpu state isn't even part of this AFAICT.  It's an
 XICS-specific ONE_REG -- which is fine, but all that's left of the
 generic API is the get/set sources which is an imperfect match to
 our per-IRQ state and it's not clear how an 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
 On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
 I have no particular objection to the device control API per se, but
 I have two objections to using it as the primary interface to the XICS
 emulation.
 
 First, I dislike the magical side-effect where creating a device of a
 particular type (e.g. MPIC or XICS) automatically attaches it to the
 interrupt lines of the vcpus.  I prefer an explicit request to do
 in-kernel interrupt control.
 
 OK.  This is device-specific behavior, so you could define it
 differently for XICS than MPIC.  I suppose we could change it for
 MPIC as well, to leave an opening for the unlikely case where we'd
 want to model an MPIC that isn't directly connected to the CPUs.

You can have cascaded MPICs; I once had a powerbook that had one MPIC
cascaded into another.

 How is the explicit request made in this patchset?

The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
specific interrupt controller architecture connected to the vcpus'
external interrupt inputs.  In that sense it's explicit, compared to a
generic create device ioctl that could be for any device.

 Secondly, it means that we are completely abandoning any attempt to
 define an abstract or generic interface to in-kernel interrupt
 controller emulations.  Each device will have its own unique set of
 attribute groups and its own unique userspace code to drive it, with
 no commonality between them.
 
 Yes.  I am unconvinced that such an abstraction is well-advised
 (especially after seeing existing generic interfaces that are
 clearly APIC-oriented).  This isn't like normal driver interfaces
 where we're abstracting away hardware differences to let generic
 code use a device.  Userspace knows what kind of device it wants,
 and how it wants it to integrate with the rest of the emulated
 system.  We'd have to go out of our way to apply the abstraction on
 *both* ends.  What do we get from that other than a chance that the
 abstraction leaks?  What significant code actually becomes common?
 kvm_set_irq() is just a thin wrapper around the ioctl.

I'd think there could be some code reduction in the live migration
code, but I'd need a qemu hacker to chime in here.

  OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
  hoping to avoid going through a standardized interface that forces a
  global interrupt numberspace.
 
 Why?
 
 The standardized interface doesn't make things any easier (as noted
 above, the caller is already mpic-specific code), and we'd have to
 come up with a scheme for flattening our interrupt numberspace
 (rather than introduce new attribute groups for things like IPI and
 timer interrupts).  It may still be necessary when it comes to
 irqfd, though...

With 24 bits of interrupt source number, you don't have to flatten it
very far.  IPIs are handled separately and don't appear in the
interrupt source space.

  How do MSIs get injected?
 
 Just like other interrupts - from the point of view of the interrupt
 controller they're edge-triggered interrupt sources.
 
 Ah right, I guess this is all set up via hcalls for XICS.
 
 With MPIC exposing its registers via the device control api,
 everything just works -- the PCI device generates a write to the
 MPIC's memory region, the QEMU MPIC stub sends the write to the
 kernel as for any other MMIO access (this passthrough is also useful
 for debugging), the in-kernel MPIC sees the write to the generate
 an MSI register and does its thing.  Compare that to all special
 the MSI code for APIC... :-)

You're doing a round trip to userspace for every MPIC register access
by the guest?  Seriously?  If that's so, then why bother with
in-kernel emulation at all?  Once you're back in userspace, it's just
as fast to do the emulation there as in the kernel.

 There are plenty of bits free in the 64 bits per source that I have
 allowed.  We can accommodate those things.
 
 MPIC vector numbers take up 16 of the bits.  The architected
 interrupt level field is 8 bits, though only a handful of values are
 actually needed.  Add a couple binary flags, and it gets pretty
 tight if a third type of interrupt controller starts wanting
 something new.

There's enough space for MPIC to have 16 bits of vector and some
flags.  We don't need to overdesign this.

 Yes, the names of the bitfields in the ICP state word are
 XICS-specific, but the concepts are pretty generic - current processor
 priority, identifier for interrupt awaiting service, pending IPI
 request priority, pending interrupt request priority.
 
 We don't have separate concepts of pending IPI request priority
 and pending interrupt request priority.  There can be multiple

Sorry, I meant pending interrupt request.  You do have that, it's
what you read from the interrupt acknowledge register.

 interrupts awaiting service (or even in service, if different
 priorities).  We have both current task priority (which is a
 user-set 

[PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-14 Thread Paul Mackerras
From: Benjamin Herrenschmidt b...@kernel.crashing.org

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
should use in-kernel interrupt controller emulation, but also takes an
argument struct that contains the type of interrupt controller
architecture and an optional parameter.  Currently only one type value
is defined, that which indicates the XICS architecture.

The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs.  Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Two other new ioctls allow userspace to control the interrupt
sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
destination cpu, level/edge sensitivity and pending state of a range
of interrupt sources, creating them if they don't already exist.  The
KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
interrupt sources (they are required to already exist).

Each vcpu gets one ICP (interrupt controller presentation) entity.
They are created automatically when the vcpu is created provided the
KVM_CREATE_IRQCHIP_ARGS ioctl has been performed.

This is based on an initial implementation by Michael Ellerman
mich...@ellerman.id.au reworked by Benjamin Herrenschmidt and
Paul Mackerras.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt |   51 ++
 arch/powerpc/include/asm/kvm_book3s.h |1 +
 arch/powerpc/include/asm/kvm_host.h   |8 +
 arch/powerpc/include/asm/kvm_ppc.h|   19 +
 arch/powerpc/kvm/Makefile |1 +
 arch/powerpc/kvm/book3s.c |2 +-
 arch/powerpc/kvm/book3s_hv.c  |   20 +
 arch/powerpc/kvm/book3s_pr.c  |   13 +
 arch/powerpc/kvm/book3s_pr_papr.c |   16 +
 arch/powerpc/kvm/book3s_rtas.c|   51 +-
 arch/powerpc/kvm/book3s_xics.c| 1101 +
 arch/powerpc/kvm/book3s_xics.h|  111 
 arch/powerpc/kvm/powerpc.c|   23 +
 include/uapi/linux/kvm.h  |   29 +
 14 files changed, 1444 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_xics.c
 create mode 100644 arch/powerpc/kvm/book3s_xics.h

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index d3e2d60..0ff9dcf 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2141,6 +2141,57 @@ associated with the service will be forgotten, and 
subsequent RTAS
 calls by the guest for that service will be passed to userspace to be
 handled.
 
+4.80 KVM_CREATE_IRQCHIP_ARGS
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irqchip_args
+Returns: 0 on success, -1 on error
+
+Creates an interrupt controller model in the kernel.  The type field
+of the argument struct indicates the interrupt controller architecture
+of the virtual machine.  Currently the only value permitted for the
+type field is 1, indicating the XICS (eXternal Interrupt Controller
+Specification) model defined in PAPR.  For XICS, this ioctl indicates
+to the kernel that an interrupt controller presentation (ICP) entity
+should be created for every vcpu, and interrupt controller source
+(ICS) entities should be created to accommodate the sources that are
+configured with the KVM_IRQCHIP_SET_SOURCES ioctl.
+
+4.81 KVM_IRQCHIP_GET_SOURCES
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irq_sources
+Returns: 0 on success, -1 on error
+
+Copies configuration and status information about a range of interrupt
+sources into a user-supplied buffer.  The argument struct gives the
+starting interrupt source number and the number of interrupt sources.
+The user buffer is an array of 64-bit quantities, one per interrupt
+source, with (from the least- significant bit) 32 bits of interrupt
+server number, 8 bits of priority, and 1 bit each for a
+level-sensitive indicator, a masked indicator, and a pending
+indicator.  If some of the sources in the range don't exist, that is,
+have not yet been created with the KVM_IRQCHIP_SET_SOURCES ioctl,
+this returns an ENODEV error.
+
+4.82 KVM_IRQCHIP_SET_SOURCES
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irq_sources
+Returns: 0 on success, -1 on error
+
+Sets the configuration and status for a range of interrupt sources
+from information supplied in a user-supplied buffer, creating the
+sources if they don't already exist.  The argument struct gives the
+starting interrupt source number 

[PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-14 Thread Paul Mackerras
From: Benjamin Herrenschmidt b...@kernel.crashing.org

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
should use in-kernel interrupt controller emulation, but also takes an
argument struct that contains the type of interrupt controller
architecture and an optional parameter.  Currently only one type value
is defined, that which indicates the XICS architecture.

The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs.  Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Two other new ioctls allow userspace to control the interrupt
sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
destination cpu, level/edge sensitivity and pending state of a range
of interrupt sources, creating them if they don't already exist.  The
KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
interrupt sources (they are required to already exist).

Each vcpu gets one ICP (interrupt controller presentation) entity.
They are created automatically when the vcpu is created provided the
KVM_CREATE_IRQCHIP_ARGS ioctl has been performed.

This is based on an initial implementation by Michael Ellerman
mich...@ellerman.id.au reworked by Benjamin Herrenschmidt and
Paul Mackerras.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt |   51 ++
 arch/powerpc/include/asm/kvm_book3s.h |1 +
 arch/powerpc/include/asm/kvm_host.h   |8 +
 arch/powerpc/include/asm/kvm_ppc.h|   19 +
 arch/powerpc/kvm/Makefile |1 +
 arch/powerpc/kvm/book3s.c |2 +-
 arch/powerpc/kvm/book3s_hv.c  |   20 +
 arch/powerpc/kvm/book3s_pr.c  |   13 +
 arch/powerpc/kvm/book3s_pr_papr.c |   16 +
 arch/powerpc/kvm/book3s_rtas.c|   51 +-
 arch/powerpc/kvm/book3s_xics.c| 1101 +
 arch/powerpc/kvm/book3s_xics.h|  111 
 arch/powerpc/kvm/powerpc.c|   23 +
 include/uapi/linux/kvm.h  |   29 +
 14 files changed, 1444 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_xics.c
 create mode 100644 arch/powerpc/kvm/book3s_xics.h

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index d3e2d60..0ff9dcf 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2141,6 +2141,57 @@ associated with the service will be forgotten, and 
subsequent RTAS
 calls by the guest for that service will be passed to userspace to be
 handled.
 
+4.80 KVM_CREATE_IRQCHIP_ARGS
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irqchip_args
+Returns: 0 on success, -1 on error
+
+Creates an interrupt controller model in the kernel.  The type field
+of the argument struct indicates the interrupt controller architecture
+of the virtual machine.  Currently the only value permitted for the
+type field is 1, indicating the XICS (eXternal Interrupt Controller
+Specification) model defined in PAPR.  For XICS, this ioctl indicates
+to the kernel that an interrupt controller presentation (ICP) entity
+should be created for every vcpu, and interrupt controller source
+(ICS) entities should be created to accommodate the sources that are
+configured with the KVM_IRQCHIP_SET_SOURCES ioctl.
+
+4.81 KVM_IRQCHIP_GET_SOURCES
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irq_sources
+Returns: 0 on success, -1 on error
+
+Copies configuration and status information about a range of interrupt
+sources into a user-supplied buffer.  The argument struct gives the
+starting interrupt source number and the number of interrupt sources.
+The user buffer is an array of 64-bit quantities, one per interrupt
+source, with (from the least- significant bit) 32 bits of interrupt
+server number, 8 bits of priority, and 1 bit each for a
+level-sensitive indicator, a masked indicator, and a pending
+indicator.  If some of the sources in the range don't exist, that is,
+have not yet been created with the KVM_IRQCHIP_SET_SOURCES ioctl,
+this returns an ENODEV error.
+
+4.82 KVM_IRQCHIP_SET_SOURCES
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irq_sources
+Returns: 0 on success, -1 on error
+
+Sets the configuration and status for a range of interrupt sources
+from information supplied in a user-supplied buffer, creating the
+sources if they don't already exist.  The argument struct gives the
+starting interrupt source number 

[RFC PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2012-11-04 Thread Paul Mackerras
From: Benjamin Herrenschmidt b...@kernel.crashing.org

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

It supports up to 4095 BUIDs (blocks of interrupts) of up to 4096
interrupts each.  Each vcpu gets one ICP (interrupt controller
presentation) entity, and each BUID gets one ICS (interrupt controller
source) entity.  These entities are created with the new
KVM_CREATE_IRQCHIP_ARGS ioctl, which takes a structure indicating
which type of controller to create and the parameters pertaining to it
(flags for both, BUID and number of interrupt sources for ICS).
We don't use the existing KVM_CREATE_IRQCHIP ioctl because it doesn't
allow us to pass any parameters defining the new interrupt controller.

This is based on an initial implementation by Michael Ellerman
mich...@ellerman.id.au reworked by myself.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt |   17 +
 arch/powerpc/include/asm/kvm_book3s.h |1 +
 arch/powerpc/include/asm/kvm_host.h   |8 +
 arch/powerpc/include/asm/kvm_ppc.h|   18 +
 arch/powerpc/include/uapi/asm/kvm.h   |   32 +
 arch/powerpc/kvm/Makefile |1 +
 arch/powerpc/kvm/book3s.c |2 +-
 arch/powerpc/kvm/book3s_hv.c  |   20 +
 arch/powerpc/kvm/book3s_pr.c  |   13 +
 arch/powerpc/kvm/book3s_pr_papr.c |   16 +
 arch/powerpc/kvm/book3s_rtas.c|   51 +-
 arch/powerpc/kvm/book3s_xics.c| 1112 +
 arch/powerpc/kvm/powerpc.c|   21 +
 include/uapi/linux/kvm.h  |6 +-
 14 files changed, 1315 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_xics.c

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 8ca0a7c..fbe018e 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2090,6 +2090,23 @@ associated with the service will be forgotten, and 
subsequent RTAS
 calls by the guest for that service will be passed to userspace to be
 handled.
 
+4.79 KVM_CREATE_IRQCHIP_ARGS
+
+Capability: KVM_CAP_SPAPR_XICS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irqchip_args
+Returns: 0 on success, -1 on error
+
+Creates an interrupt controller model in the kernel.  This is
+currently only implemented for the XICS (eXternal Interrupt Controller
+Specification) model defined in PAPR.  This ioctl creates either an
+interrupt controller presentation (ICP) or an interrupt controller
+source (ICS) entity, depending on the type field of the argument
+struct.  When creating an ICS, the argument struct further indicates
+the BUID (Bus Unit ID) and number of interrupt sources for the new
+ICS.
+
 
 5. The kvm_run structure
 
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 36fcf41..fbda9b1 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -140,6 +140,7 @@ extern int kvmppc_mmu_hv_init(void);
 extern int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int size, void *ptr, 
bool data);
 extern int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int size, void *ptr, 
bool data);
 extern void kvmppc_book3s_queue_irqprio(struct kvm_vcpu *vcpu, unsigned int 
vec);
+extern void kvmppc_book3s_dequeue_irqprio(struct kvm_vcpu *vcpu, unsigned int 
vec);
 extern void kvmppc_inject_interrupt(struct kvm_vcpu *vcpu, int vec, u64 flags);
 extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat,
   bool upper, u32 val);
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 8e745a8..0136e1e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -190,6 +190,10 @@ struct kvmppc_linear_info {
int  type;
 };
 
+/* XICS components, defined in boo3s_xics.c */
+struct kvmppc_xics;
+struct kvmppc_icp;
+
 /*
  * The reverse mapping array has one entry for each HPTE,
  * which stores the guest's view of the second word of the HPTE
@@ -256,6 +260,7 @@ struct kvm_arch {
 #ifdef CONFIG_PPC_BOOK3S_64
struct list_head spapr_tce_tables;
struct list_head rtas_tokens;
+   struct kvmppc_xics *xics;
 #endif
 };
 
@@ -565,6 +570,9 @@ struct kvm_vcpu_arch {
u64 busy_stolen;
u64 busy_preempt;
 #endif
+#ifdef CONFIG_PPC_BOOK3S_64
+   struct kvmppc_icp *icp; /* XICS presentation controller */
+#endif
 };
 
 /* Values for vcpu-arch.state */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 996aeca..c74fd20 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -131,6 +131,12 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void 

[RFC PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2012-11-04 Thread Paul Mackerras
From: Benjamin Herrenschmidt b...@kernel.crashing.org

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

It supports up to 4095 BUIDs (blocks of interrupts) of up to 4096
interrupts each.  Each vcpu gets one ICP (interrupt controller
presentation) entity, and each BUID gets one ICS (interrupt controller
source) entity.  These entities are created with the new
KVM_CREATE_IRQCHIP_ARGS ioctl, which takes a structure indicating
which type of controller to create and the parameters pertaining to it
(flags for both, BUID and number of interrupt sources for ICS).
We don't use the existing KVM_CREATE_IRQCHIP ioctl because it doesn't
allow us to pass any parameters defining the new interrupt controller.

This is based on an initial implementation by Michael Ellerman
mich...@ellerman.id.au reworked by myself.

Signed-off-by: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Paul Mackerras pau...@samba.org
---
 Documentation/virtual/kvm/api.txt |   17 +
 arch/powerpc/include/asm/kvm_book3s.h |1 +
 arch/powerpc/include/asm/kvm_host.h   |8 +
 arch/powerpc/include/asm/kvm_ppc.h|   18 +
 arch/powerpc/include/uapi/asm/kvm.h   |   32 +
 arch/powerpc/kvm/Makefile |1 +
 arch/powerpc/kvm/book3s.c |2 +-
 arch/powerpc/kvm/book3s_hv.c  |   20 +
 arch/powerpc/kvm/book3s_pr.c  |   13 +
 arch/powerpc/kvm/book3s_pr_papr.c |   16 +
 arch/powerpc/kvm/book3s_rtas.c|   51 +-
 arch/powerpc/kvm/book3s_xics.c| 1112 +
 arch/powerpc/kvm/powerpc.c|   21 +
 include/uapi/linux/kvm.h  |6 +-
 14 files changed, 1315 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_xics.c

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 8ca0a7c..fbe018e 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2090,6 +2090,23 @@ associated with the service will be forgotten, and 
subsequent RTAS
 calls by the guest for that service will be passed to userspace to be
 handled.
 
+4.79 KVM_CREATE_IRQCHIP_ARGS
+
+Capability: KVM_CAP_SPAPR_XICS
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_irqchip_args
+Returns: 0 on success, -1 on error
+
+Creates an interrupt controller model in the kernel.  This is
+currently only implemented for the XICS (eXternal Interrupt Controller
+Specification) model defined in PAPR.  This ioctl creates either an
+interrupt controller presentation (ICP) or an interrupt controller
+source (ICS) entity, depending on the type field of the argument
+struct.  When creating an ICS, the argument struct further indicates
+the BUID (Bus Unit ID) and number of interrupt sources for the new
+ICS.
+
 
 5. The kvm_run structure
 
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 36fcf41..fbda9b1 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -140,6 +140,7 @@ extern int kvmppc_mmu_hv_init(void);
 extern int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int size, void *ptr, 
bool data);
 extern int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int size, void *ptr, 
bool data);
 extern void kvmppc_book3s_queue_irqprio(struct kvm_vcpu *vcpu, unsigned int 
vec);
+extern void kvmppc_book3s_dequeue_irqprio(struct kvm_vcpu *vcpu, unsigned int 
vec);
 extern void kvmppc_inject_interrupt(struct kvm_vcpu *vcpu, int vec, u64 flags);
 extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat *bat,
   bool upper, u32 val);
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 8e745a8..0136e1e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -190,6 +190,10 @@ struct kvmppc_linear_info {
int  type;
 };
 
+/* XICS components, defined in boo3s_xics.c */
+struct kvmppc_xics;
+struct kvmppc_icp;
+
 /*
  * The reverse mapping array has one entry for each HPTE,
  * which stores the guest's view of the second word of the HPTE
@@ -256,6 +260,7 @@ struct kvm_arch {
 #ifdef CONFIG_PPC_BOOK3S_64
struct list_head spapr_tce_tables;
struct list_head rtas_tokens;
+   struct kvmppc_xics *xics;
 #endif
 };
 
@@ -565,6 +570,9 @@ struct kvm_vcpu_arch {
u64 busy_stolen;
u64 busy_preempt;
 #endif
+#ifdef CONFIG_PPC_BOOK3S_64
+   struct kvmppc_icp *icp; /* XICS presentation controller */
+#endif
 };
 
 /* Values for vcpu-arch.state */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 996aeca..c74fd20 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -131,6 +131,12 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void