Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller

Cédric Le Goater Fri, 04 May 2018 08:36:07 -0700

On 05/04/2018 08:35 AM, David Gibson wrote:
> On Thu, May 03, 2018 at 10:43:47AM +0200, Cédric Le Goater wrote:
>> On 05/03/2018 04:29 AM, David Gibson wrote:
>>> On Thu, Apr 26, 2018 at 10:17:13AM +0200, Cédric Le Goater wrote:
>>>> On 04/26/2018 07:36 AM, David Gibson wrote:
>>>>> On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
>>>>>> On 04/16/2018 06:26 AM, David Gibson wrote:
>>>>>>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
>>>>>>>> On 04/12/2018 07:07 AM, David Gibson wrote:
>>>>>>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>>>>>>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>>>>>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
>>>>>> wrote:
>>>>> [snip]
>>>>>>>> The XIVE tables are :
>>>>>>>>
>>>>>>>> * IVT
>>>>>>>>
>>>>>>>>   associate an interrupt source number with an event queue. the data
>>>>>>>>   to be pushed in the queue is stored there also.
>>>>>>>
>>>>>>> Ok, so there would be one of these tables for each IVRE, 
>>>>>>
>>>>>> yes. one for each XIVE interrupt controller. That is one per processor 
>>>>>> or socket.
>>>>>
>>>>> Ah.. so there can be more than one in a multi-socket system.
>>>>>  >>> with one entry for each source managed by that IVSE, yes?
>>>>>>
>>>>>> yes. The table is simply indexed by the interrupt number in the
>>>>>> global IRQ number space of the machine.
>>>>>
>>>>> How does that work on a multi-chip machine?  Does each chip just have
>>>>> a table for a slice of the global irq number space?
>>>>
>>>> yes. IRQ Allocation is done relative to the chip, each chip having 
>>>> a range depending on its block id. XIVE has a concept of block,
>>>> which is used in skiboot in a one-to-one relationship with the chip.
>>>
>>> Ok.  I'm assuming this block id forms the high(ish) bits of the global
>>> irq number, yes?
>>
>> yes. the 8 top bits are reserved, the next 4 bits are for the 
>> block id, 16 blocks for 16 socket/chips, and the 20 lower bits 
>> are for the ISN.
> 
> Ok.
> 
>>>>>>> Do the XIVE IPIs have entries here, or do they bypass this?
>>>>>>
>>>>>> no. The IPIs have entries also in this table.
>>>>>>
>>>>>>>> * EQDT:
>>>>>>>>
>>>>>>>>   describes the queues in the OS RAM, also contains a set of flags,
>>>>>>>>   a virtual target, etc.
>>>>>>>
>>>>>>> So on real hardware this would be global, yes?  And it would be
>>>>>>> consulted by the IVRE?
>>>>>>
>>>>>> yes. Exactly. The XIVE routing routine :
>>>>>>
>>>>>>  https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
>>>>>>
>>>>>> gives a good overview of the usage of the tables.
>>>>>>
>>>>>>> For guests, we'd expect one table per-guest?  
>>>>>>
>>>>>> yes but only in emulation mode. 
>>>>>
>>>>> I'm not sure what you mean by this.
>>>>
>>>> I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall 
>>>> table allocated in OPAL for the system. 
>>>
>>> Right.. I'm thinking of this from the point of view of the guest
>>> and/or qemu, rather than from the implementation.  Even if the actual
>>> storage of the entries is distributed across the host's global table,
>>> we still logically have a table per guest, right?
>>
>> Yes. (the XiveSource object would be the table-per-guest and its 
>> counterpart in KVM: the source block)
> 
> Uh.. I'm talking about the IVT (or a slice of it) here, so this would
> be a XiveRouter, not a XiveSource owning it.


yes. Sorry. sPAPR has a unique XiveSource and a corresponding IVT.

>>>>>>> How would those be integrated with the host table?
>>>>>>
>>>>>> Under KVM, this is handled by the host table (setup done in skiboot) 
>>>>>> and we are only interested in the state of the EQs for migration.
>>>>>
>>>>> This doesn't make sense to me; the guest is able to alter the IVT
>>>>> entries, so that configuration must be migrated somehow.
>>>>
>>>> yes. The IVE needs to be migrated. We use get/set KVM ioctls to save 
>>>> and restore the value which is cached in the KVM irq state struct 
>>>> (server, prio, eq data). no OPAL calls are needed though.
>>>
>>> Right.  Again, at this stage I don't particularly care what the
>>> backend details are - whether the host calls OPAL or whatever.  I'm
>>> more concerned with the logical model.
>>
>> ok.
>>
>>>
>>>>>> This state is set  with the H_INT_SET_QUEUE_CONFIG hcall,
>>>>>
>>>>> "This state" here meaning IVT entries?
>>>>
>>>> no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a 
>>>> server/priority couple. That is where the event queue data is
>>>> pushed.
>>>
>>> Ah.  Doesn't that mean the guest *does* effectively have an EQD table,
>>
>> well, yes, it's behing the hood. but the guest does not know anything 
>> about the Xive controller internal structures, IVE, EQD, VPD and tables. 
>> Only OPAL does in fact.
> 
> Right, it's under the hood.  But then so is the IVT (and the TCE
> tables and the HPT for that matter).  So we're probably going to have
> a (*get_eqd) method somewhere that looks up in guest RAM or in an
> external table depending.

yes. definitely. 

C.

>>> updated by this call?  
>>
>> it is indeed the purpose of H_INT_SET_QUEUE_CONFIG
>>
>>> We'd need to migrate that data as well, 
>>
>> yes we do and some fields require OPAL support.
>>
>>> and it's not part of the IVT, right?
>>
>> yes. The IVT only contains the EQ index, the server/priority tuple used 
>> for routing.
>>
>>>> H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
>>>> and the eq data to be pushed in case of an event.
>>>
>>> Ok - that's the IVT entries, yes?
>>
>> yes.
>>
>>
>>>>>> followed
>>>>>> by an OPAL call and then a HW update. It defines the EQ page in which
>>>>>> to push event notification for the couple server/priority. 
>>>>>>
>>>>>>>> * VPDT:
>>>>>>>>
>>>>>>>>   describe the virtual targets, which can have different natures,
>>>>>>>>   a lpar, a cpu. This is for powernv, spapr does not have this 
>>>>>>>>   concept.
>>>>>>>
>>>>>>> Ok  On hardware that would also be global and consulted by the IVRE,
>>>>>>> yes?
>>>>>>
>>>>>> yes.
>>>>>
>>>>> Except.. is it actually global, or is there one per-chip/socket?
>>>>
>>>> There is a global VP allocator splitting the ids depending on the
>>>> block/chip, but, to be honest, I have not dug in the details
>>>>
>>>>> [snip]
>>>>>>>>    In the current version I am working on, the XiveFabric interface is
>>>>>>>>    more complex :
>>>>>>>>
>>>>>>>>        typedef struct XiveFabricClass {
>>>>>>>>            InterfaceClass parent;
>>>>>>>>            XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>>>>>>
>>>>>>> This does an IVT lookup, I take it?
>>>>>>
>>>>>> yes. It is an interface for the underlying storage, which is different
>>>>>> in sPAPR and PowerNV. The goal is to make the routing generic.
>>>>>
>>>>> Right.  So, yes, we definitely want a method *somehwere* to do an IVT
>>>>> lookup.  I'm not entirely sure where it belongs yet.
>>>>
>>>> Me either. I have stuffed the XiveFabric with all the abstraction 
>>>> needed for the moment. 
>>>>
>>>> I am starting to think that there should be an interface to forward 
>>>> events and another one to route them. The router being a special case 
>>>> of the forwarder, the last one. The "simple" devices, like PSI, should 
>>>> only be forwarders for the sources they own but the interrupt controllers 
>>>> should be forwarders (they have sources) and also routers.
>>>
>>> I'm not really clear what you mean by "forward" here.
>>
>> When a interrupt source is triggered, a notification event can
>> be generated and forwarded to the XIVE router if the transition 
>> algo (depending on the PQ bit) lets it through. A forward is 
>> a simple load of the IRQ number at a specific MMIO address defined
>> by the main IC.
>>
>> For QEMU sPAPR, it's a funtion call but for QEMU powernv, it's a
>> load.
>>
>> C.
>>
>>
>>>>>>>>            XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>>>>>>>
>>>>>>> This one a VPDT lookup, yes?
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>>>            XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
>>>>>>>
>>>>>>> And this one an EQDT lookup?
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>>>        } XiveFabricClass;
>>>>>>>>
>>>>>>>>    It helps in making the routing algorithm independent of the model. 
>>>>>>>>    I hope to make powernv converge and use it.
>>>>>>>>
>>>>>>>>  - a set of MMIOs for the TIMA. They model the presenter engine. 
>>>>>>>>    current_cpu is used to retrieve the NVT object, which holds the 
>>>>>>>>    registers for interrupt management.  
>>>>>>>
>>>>>>> Right.  Now the TIMA is local to a target/server not an EQ, right?
>>>>>>
>>>>>> The TIMA is the MMIO giving access to the registers which are per CPU. 
>>>>>> The EQ are for routing. They are under the CPU object because it is 
>>>>>> convenient.
>>>>>>  
>>>>>>> I guess we need at least one of these per-vcpu.  
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>> Do we also need an lpar-global, or other special ones?
>>>>>>
>>>>>> That would be for the host. AFAICT KVM does not use such special
>>>>>> VPs.
>>>>>
>>>>> Um.. "does not use".. don't we get to decide that?
>>>>
>>>> Well, that part in the specs is still a little obscure for me and 
>>>> I am not sure it will fit very well in the Linux/KVM model. It should 
>>>> be hidden to the guest anyway and can come in later.
>>>>
>>>>>>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT 
>>>>>>>> table. But we could add one under the XIVE device model.
>>>>>>>
>>>>>>> I'm not sure of the distinction you're drawing between the NVT and the
>>>>>>> XIVE device mode.
>>>>>>
>>>>>> we could add a new table under the XIVE interrupt device model 
>>>>>> sPAPRXive to store the EQs and indexed them like skiboot does. 
>>>>>> But it seems unnecessary to me as we can use the object below 
>>>>>> 'cpu->intc', which is the XiveNVT object.  
>>>>>
>>>>> So, basically assuming a fixed set of EQs (one per priority?)
>>>>
>>>> yes. It's easier to capture the state and dump information from
>>>> the monitor.
>>>>
>>>>> per CPU for a PAPR guest?  
>>>>
>>>> yes, that's own it works.
>>>>
>>>>> That makes sense (assuming PAPR doesn't provide guest interfaces to 
>>>>> ask for something else).
>>>>
>>>> Yes. All hcalls take prio/server parameters and the reserved prio range 
>>>> for the platform is in the device tree. 0xFF is a special case to reset 
>>>> targeting. 
>>>>
>>>> Thanks,
>>>>
>>>> C.
>>>>
>>>
>>
>

Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller

Reply via email to