On 12/12/18 1:19 AM, David Gibson wrote: > On Tue, Dec 11, 2018 at 10:06:46AM +0100, Cédric Le Goater wrote: >> On 12/11/18 1:38 AM, David Gibson wrote: >>> On Mon, Dec 10, 2018 at 08:53:17AM +0100, Cédric Le Goater wrote: >>>> On 12/10/18 7:39 AM, David Gibson wrote: >>>>> On Sun, Dec 09, 2018 at 08:46:00PM +0100, Cédric Le Goater wrote: >>>>>> The XIVE interface for the guest is described in the device tree under >>>>>> the "interrupt-controller" node. A couple of new properties are >>>>>> specific to XIVE : >>>>>> >>>>>> - "reg" >>>>>> >>>>>> contains the base address and size of the thread interrupt >>>>>> managnement areas (TIMA), for the User level and for the Guest OS >>>>>> level. Only the Guest OS level is taken into account today. >>>>>> >>>>>> - "ibm,xive-eq-sizes" >>>>>> >>>>>> the size of the event queues. One cell per size supported, contains >>>>>> log2 of size, in ascending order. >>>>>> >>>>>> - "ibm,xive-lisn-ranges" >>>>>> >>>>>> the IRQ interrupt number ranges assigned to the guest for the IPIs. >>>>>> >>>>>> and also under the root node : >>>>>> >>>>>> - "ibm,plat-res-int-priorities" >>>>>> >>>>>> contains a list of priorities that the hypervisor has reserved for >>>>>> its own use. OPAL uses the priority 7 queue to automatically >>>>>> escalate interrupts for all other queues (DD2.X POWER9). So only >>>>>> priorities [0..6] are allowed for the guest. >>>>>> >>>>>> Extend the sPAPR IRQ backend with a new handler to populate the DT >>>>>> with the appropriate "interrupt-controller" node. >>>>>> >>>>>> Signed-off-by: Cédric Le Goater <c...@kaod.org> >>>>>> --- >>>>>> include/hw/ppc/spapr_irq.h | 2 ++ >>>>>> include/hw/ppc/spapr_xive.h | 2 ++ >>>>>> include/hw/ppc/xics.h | 4 +-- >>>>>> hw/intc/spapr_xive.c | 64 +++++++++++++++++++++++++++++++++++++ >>>>>> hw/intc/xics_spapr.c | 3 +- >>>>>> hw/ppc/spapr.c | 3 +- >>>>>> hw/ppc/spapr_irq.c | 3 ++ >>>>>> 7 files changed, 77 insertions(+), 4 deletions(-) >>>>>> >>>>>> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h >>>>>> index 23cdb51b879e..e51e9f052f63 100644 >>>>>> --- a/include/hw/ppc/spapr_irq.h >>>>>> +++ b/include/hw/ppc/spapr_irq.h >>>>>> @@ -39,6 +39,8 @@ typedef struct sPAPRIrq { >>>>>> void (*free)(sPAPRMachineState *spapr, int irq, int num); >>>>>> qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq); >>>>>> void (*print_info)(sPAPRMachineState *spapr, Monitor *mon); >>>>>> + void (*dt_populate)(sPAPRMachineState *spapr, uint32_t nr_servers, >>>>>> + void *fdt, uint32_t phandle); >>>>>> } sPAPRIrq; >>>>>> >>>>>> extern sPAPRIrq spapr_irq_xics; >>>>>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h >>>>>> index 9506a8f4d10a..728a5e8dc163 100644 >>>>>> --- a/include/hw/ppc/spapr_xive.h >>>>>> +++ b/include/hw/ppc/spapr_xive.h >>>>>> @@ -45,5 +45,7 @@ qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t >>>>>> lisn); >>>>>> typedef struct sPAPRMachineState sPAPRMachineState; >>>>>> >>>>>> void spapr_xive_hcall_init(sPAPRMachineState *spapr); >>>>>> +void spapr_dt_xive(sPAPRMachineState *spapr, uint32_t nr_servers, void >>>>>> *fdt, >>>>>> + uint32_t phandle); >>>>>> >>>>>> #endif /* PPC_SPAPR_XIVE_H */ >>>>>> diff --git a/include/hw/ppc/xics.h b/include/hw/ppc/xics.h >>>>>> index 9958443d1984..14afda198cdb 100644 >>>>>> --- a/include/hw/ppc/xics.h >>>>>> +++ b/include/hw/ppc/xics.h >>>>>> @@ -181,8 +181,6 @@ typedef struct XICSFabricClass { >>>>>> ICPState *(*icp_get)(XICSFabric *xi, int server); >>>>>> } XICSFabricClass; >>>>>> >>>>>> -void spapr_dt_xics(int nr_servers, void *fdt, uint32_t phandle); >>>>>> - >>>>>> ICPState *xics_icp_get(XICSFabric *xi, int server); >>>>>> >>>>>> /* Internal XICS interfaces */ >>>>>> @@ -204,6 +202,8 @@ void icp_resend(ICPState *ss); >>>>>> >>>>>> typedef struct sPAPRMachineState sPAPRMachineState; >>>>>> >>>>>> +void spapr_dt_xics(sPAPRMachineState *spapr, uint32_t nr_servers, void >>>>>> *fdt, >>>>>> + uint32_t phandle); >>>>>> int xics_kvm_init(sPAPRMachineState *spapr, Error **errp); >>>>>> void xics_spapr_init(sPAPRMachineState *spapr); >>>>>> >>>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c >>>>>> index 982ac6e17051..a6d854b07690 100644 >>>>>> --- a/hw/intc/spapr_xive.c >>>>>> +++ b/hw/intc/spapr_xive.c >>>>>> @@ -14,6 +14,7 @@ >>>>>> #include "target/ppc/cpu.h" >>>>>> #include "sysemu/cpus.h" >>>>>> #include "monitor/monitor.h" >>>>>> +#include "hw/ppc/fdt.h" >>>>>> #include "hw/ppc/spapr.h" >>>>>> #include "hw/ppc/spapr_xive.h" >>>>>> #include "hw/ppc/xive.h" >>>>>> @@ -1381,3 +1382,66 @@ void spapr_xive_hcall_init(sPAPRMachineState >>>>>> *spapr) >>>>>> spapr_register_hypercall(H_INT_SYNC, h_int_sync); >>>>>> spapr_register_hypercall(H_INT_RESET, h_int_reset); >>>>>> } >>>>>> + >>>>>> +void spapr_dt_xive(sPAPRMachineState *spapr, uint32_t nr_servers, void >>>>>> *fdt, >>>>>> + uint32_t phandle) >>>>>> +{ >>>>>> + sPAPRXive *xive = spapr->xive; >>>>>> + int node; >>>>>> + uint64_t timas[2 * 2]; >>>>>> + /* Interrupt number ranges for the IPIs */ >>>>>> + uint32_t lisn_ranges[] = { >>>>>> + cpu_to_be32(0), >>>>>> + cpu_to_be32(nr_servers), >>>>>> + }; >>>>>> + uint32_t eq_sizes[] = { >>>>>> + cpu_to_be32(12), /* 4K */ >>>>>> + cpu_to_be32(16), /* 64K */ >>>>>> + cpu_to_be32(21), /* 2M */ >>>>>> + cpu_to_be32(24), /* 16M */ >>>>> >>>>> For KVM, are we going to need to clamp this list based on the >>>>> pagesizes the guest can use? >>>> >>>> I would say so. Is there a KVM service for that ? >>> >>> I'm not sure what you mean by a KVM service here. >> >> I meant a routine giving a list of the possible backing pagesizes. > > I'm still not really sure what you mean by that. > >>>> Today, the OS scans the list and picks the size fitting its current >>>> PAGE_SIZE/SHIFT. But I suppose it would be better to advertise >>>> only the page sizes that QEMU/KVM supports. Or should we play safe >>>> and only export : 4K, 64K ? >>> >>> So, I'm guessing the limitation here is that when we use the KVM XIVE >>> acceleration, the EQ will need to be host-contiguous? >> >> The EQ should be a single page (from the specs). So we don't have >> a contiguity problem. > > A single page according to whom, though? A RPT guest can use 2MiB > pages, even if it is backed with smaller pages on the host, but I'm > guessing it would not work to have its EQ be a 2MiB page, since it > won't be host-contiguous.
ok. yes. the HW won't be able to handle that. The specs are not very clear on that topic. The Qsize field is 4bits, page size being (2^(n+12)), so that's a lot of different page sizes but the implementation only refers to 4K and 64K. C. >>> So, we can't have EQs larger than the backing pagesize for guest >>> memory. >>> >>> We try to avoid having guest visible differences based on whether >>> we're using KVM or not, so we don't want to change this list depending >>> on whether KVM is active etc. - or on whether we're backing the guest >>> with hugepages. >>> >>> We could reuse the cap-hpt-max-page-size machine parameter which also >>> relates to backing page size, but that might be confusing since it's >>> explicitly about HPT only whereas this limitation would apply to RPT >>> guests too, IIUC. >>> >>> I think our best bet is probably to limit to 64kiB queues >>> unconditionally. >> >> OK. That's the default. We can refine this list of page sizes later on >> when we introduce KVM support if needed. >> >>> AIUI, that still gives us 8192 slots in the queue, >> >> The entries are 32bits. So that's 16384 entries. > > Even better. > >>> which is enough to store one of every possible irq, which should be >>> enough. >> >> yes. >> >> I don't think Linux measures how much entries are dequeued at once. >> It would give us a feel of how much pressure these EQs can sustain. >> >> C. >> >>> There's still an issue if we have a 4kiB pagesize host kernel, but we >>> could just disable kernel_irqchip in that situation, since we don't >>> expect people to be using that much in practice. >>> >>>>>> + }; >>>>>> + /* The following array is in sync with the reserved priorities >>>>>> + * defined by the 'spapr_xive_priority_is_reserved' routine. >>>>>> + */ >>>>>> + uint32_t plat_res_int_priorities[] = { >>>>>> + cpu_to_be32(7), /* start */ >>>>>> + cpu_to_be32(0xf8), /* count */ >>>>>> + }; >>>>>> + gchar *nodename; >>>>>> + >>>>>> + /* Thread Interrupt Management Area : User (ring 3) and OS (ring 2) >>>>>> */ >>>>>> + timas[0] = cpu_to_be64(xive->tm_base + >>>>>> + XIVE_TM_USER_PAGE * (1ull << TM_SHIFT)); >>>>>> + timas[1] = cpu_to_be64(1ull << TM_SHIFT); >>>>>> + timas[2] = cpu_to_be64(xive->tm_base + >>>>>> + XIVE_TM_OS_PAGE * (1ull << TM_SHIFT)); >>>>>> + timas[3] = cpu_to_be64(1ull << TM_SHIFT); >>>>> >>>>> So this gives the MMIO address of the TIMA. >>>> >>>> Yes. It is considered to be a fixed "address" >>>> >>>>> Where does the guest get the MMIO address for the ESB pages from? >>>> >>>> with the hcall H_INT_GET_SOURCE_INFO. >>> >>> Ah, right. >>> >> >