On 28.09.20 19:17, Chase Conklin wrote: > On 26.09.20 4:06, Jan Kiszka wrote: >> On 25.09.20 16:53, Jan Kiszka wrote: >>> On 25.09.20 16:44, Jan Kiszka wrote: >>>> On 25.09.20 16:21, Jan Kiszka wrote: >>>>> On 17.09.20 10:36, Oliver Schwartz wrote: >>>>>> >>>>>> >>>>>>> On 17 Sep 2020, at 09:31, Jan Kiszka <jan.kis...@siemens.com> wrote: >>>>>>> >>>>>>> On 17.09.20 09:16, Oliver Schwartz wrote: >>>>>>>>> On 15 Sep 2020, at 11:00, Jan Kiszka <jan.kis...@siemens.com >>>>>>>>> <mailto:jan.kis...@siemens.com>> wrote: >>>>>>>>> >>>>>>>>> On 15.09.20 09:07, Oliver Schwartz wrote: >>>>>>>>>> I’m currently trying out the arm64-zero-exits branch and got stuck. >>>>>>>>>> System is a Xilinx ZU9EG on a custom board, similar to zcu102. I’ve >>>>>>>>>> brought ATF up to date and patched it with Jans patch to enable >>>>>>>>>> SDEI. If I don’t enable SDEI in ATF everything works as expected >>>>>>>>>> (with VM exits for interrupts, of course). Jailhouse source is the >>>>>>>>>> tip of branch arm64-zero-exits. >>>>>>>>>> If I enable SDEI in ATF, jailhouse works most of the time, except >>>>>>>>>> for when it doesn’t. Sometimes, ‘jailhouse enable’ results in: >>>>>>>>>>> Initializing processors: >>>>>>>>>>> CPU 1... OK >>>>>>>>>>> CPU 0... >>>>>>>>>>> /home/oliver/0.12-gitAUTOINC+98061469d0-r0/git/hypervisor/arch/arm64/setup.c:73: >>>>>>>>>>> returning error -EIO >>>>>>>>> >>>>>>>>> Weird - that the SDEI event enable call. >>>>>>>>> >>>>>>>>>>> FAILED >>>>>>>>>>> JAILHOUSE_ENABLE: Input/output error >>>>>>>>>> I’ve seen this error only when I enable jailhouse through some init >>>>>>>>>> script during the boot process, when the system is also busy >>>>>>>>>> otherwise. When starting jailhouse on an idle system I haven’t seen >>>>>>>>>> this. >>>>>>>>> >>>>>>>>> Possibly a regression of my recent refactoring which I didn't manage >>>>>>>>> to test yet. Could you try if >>>>>>>>> >>>>>>>>> https://github.com/siemens/jailhouse/commits/e0ef829c85895dc6387d5ea11b08aa65a456255f >>>>>>>>> >>>>>>>>> was any better? >>>>>>>>> >>>>>>>>>> Sometimes it may hang later during ‘jailhouse enable’: >>>>>>>>>>> Initializing processors: >>>>>>>>>>> CPU 1... OK >>>>>>>>>>> CPU 0... OK >>>>>>>>>>> CPU 2... OK >>>>>>>>>>> CPU 3... OK >>>>>>>>>>> Initializing unit: irqchip >>>>>>>>>>> Using SDEI-based management interrupt >>>>>>>>>>> Initializing unit: ARM SMMU v3 >>>>>>>>>>> Initializing unit: PVU IOMMU >>>>>>>>>>> Initializing unit: PCI >>>>>>>>>>> Adding virtual PCI device 00:00.0 to cell "root" >>>>>>>>>>> Page pool usage after late setup: mem 67/992, remap 5/131072 >>>>>>>>>>> Activating hypervisor >>>>>>>>>>> [ 5.847540] The Jailhouse is opening. >>>>>>>>>> Using a JTAG debugger I see that one or more cores are stuck in >>>>>>>>>> hypervisor/arch/arm-common/psci.c, line 105. >>>>>>>>>> It may also succeed in stopping one or more CPUs and then hang >>>>>>>>>> (again with one or more cores stuck in psci.c, line 105): >>>>>>>>>>> [ 5.810220] The Jailhouse is opening. >>>>>>>>>>> [ 5.860054] CPU1: shutdown >>>>>>>>>>> [ 5.862677] psci: CPU1 killed. >>>>>>>> Now, with the first problem solved I’ve digged into the second one. >>>>>>>> It’s actually a bit worse than in my initial description: If I just do >>>>>>>> ‘jailhouse enable’ the system will always hang a few milliseconds >>>>>>>> after the command completes - the only exception is when ‘jailhouse >>>>>>>> create’ is executed immediately afterwards (which creates an inmate >>>>>>>> that uses 3 of 4 CPU cores, leaving just one for Linux), which >>>>>>>> succeeds roughly on every second try. I didn’t notice this initially >>>>>>>> because I usually start jailhouse with a script that does ‘enable’ and >>>>>>>> ‘create’. >>>>>>>> The reason for the hangs seems to be the psci emulation in Jailhouse, >>>>>>>> in particular the CPU_SUSPEND calls. These are issued from my >>>>>>>> (Xilinx-) kernel frequently if Linux has more than one core available. >>>>>>>> With SDEI disabled the core can be woken up again by some interrupt. >>>>>>>> With SDEI enabled, the core waits forever on the wfi intstruction. >>>>>>>> Because a suspended core never wakes up again the whole system hangs >>>>>>>> at some point. >>>>>>>> Any ideas why no interrupts are seen anymore in psci? My guess is that >>>>>>>> it’s because the inmate (Linux) now has full control over the GIC, so >>>>>>>> it may disable any interrupts before suspending a core, without >>>>>>>> Jailhouse noticing. If this is the case, it may be necessary to >>>>>>>> re-enable the IRQs before executing wfi. But I’m missing the big >>>>>>>> picture here - what interrupt is the core waiting for in the first >>>>>>>> place? Any other thoughts? >>>>>>> >>>>>>> You likely found a bug in the SDEI feature of Jailhouse. The >>>>>>> CPU_SUSPEND emulation assumes non-SDEI operation, i.e. interception of >>>>>>> interrupts by the hypervisor, but that is not true in this mode. >>>>>>> >>>>>>> We need a way to wait for interrupts without actually receiving them >>>>>>> when they arrive, but rather return to EL1 then. Maybe re-enabling >>>>>>> interception, waiting, and then disabling it again before returning >>>>>>> would do the trick. But then I also do not understand yet why >>>>>>> https://github.com/bao-project/bao-hypervisor/blob/master/src/arch/armv8/psci.c >>>>>>> gets away with wfi. Possibly, they run with interrupts on through the >>>>>>> hypervisor, though that would not be straightforward either. >>>>>> >>>>>> The good news is that there’s an easy workaround, at least on my system: >>>>>> disabling suspend calls before starting jailhouse >>>>>> (echo 1 > /sys/devices//system/cpu/cpu<n>/cpuidle/state1/disable). >>>>>> >>>>> >>>>> Seems the reason I was not seeing this so far is that my config [1] was >>>>> lacking CONFIG_ARM_PSCI_CPUIDLE. Seeing it now as well, let's debug. >>>>> >>>> >>>> My ideas seems to work (quick hack): >>>> >>>> diff --git a/hypervisor/arch/arm-common/psci.c >>>> b/hypervisor/arch/arm-common/psci.c >>>> index 6a9abf60..3bb3f6a8 100644 >>>> --- a/hypervisor/arch/arm-common/psci.c >>>> +++ b/hypervisor/arch/arm-common/psci.c >>>> @@ -101,6 +101,14 @@ long psci_dispatch(struct trap_context *ctx) >>>> >>>> case PSCI_0_2_FN_CPU_SUSPEND: >>>> case PSCI_0_2_FN64_CPU_SUSPEND: >>>> +if (sdei_available) { >>>> +unsigned long hcr; >>>> +arm_read_sysreg(HCR_EL2, hcr); >>>> +arm_write_sysreg(HCR_EL2, >>>> + hcr | HCR_IMO_BIT | HCR_FMO_BIT); >>>> +asm volatile("wfi" : : : "memory"); >>>> +arm_read_sysreg(HCR_EL2, hcr); >> ^^^^^^ >> Argh... >> >>>> +} else >>>> if (!irqchip_has_pending_irqs()) { >>>> asm volatile("wfi" : : : "memory"); >>>> irqchip_handle_irq(); >>>> >>>> Now, if someone with more architectural knowledge than I could explain >>>> why that's the case and if that will work on all platforms, with both >>>> GICv2 and v3 (and maybe even v4), we could convert that into real patch. >>>> Trying my luck on the CC list... >>>> >>> >>> Nää, I was too quick: wfi works, i.e. the hypervisor is woken up on >>> pending interrupts, but some more bits than simply clearing IMO/FMO in >>> HCR are needed in order to forward that pending irq event to EL1 when >>> returning to it. >>> >> >> With the above fixed to 'write', it now works fine here. >> >> While trying to understand what goes wrong, I also read what happens in >> the IMO/FMO-disabled case when an interrupt arrives while in EL2: It >> does not count as wakeup event for wfi. Only if IMO/FMO are set (or >> TGE), we are kicked out of wfi. And that's actually independent of the >> GIC model. > > Hi Jan, > > That matches my understanding. > >> What's once again unclear to me is if/where I need isb. Before the wfi? >> Also after restoring HCR_EL2? > > Definitely before the wfi. The architecture doesn't guarantee that the > write to HCR_EL2 takes effect prior to a context synchronization event, > so it's architecturally permissible (though perhaps unlikely in > practice) for the wfi to not treat the interrupt as a wfi wake-up event > without an isb between the HCR_EL2 write and wfi. > > As it's currently written, I don't think you'll need an isb after the > wfi. Jailhouse never unmasks interrupts, so there's no risk of an > unexpected Current-EL IRQ/FIQ. The return path to the guest doesn't > include any other manipulation of interrupts. As a result, Jailhouse > isn't affected by the setting of IMO/FMO prior to returning to the > guest; the context synchronization event that takes place as part of > the eret is sufficient to ensure the clearing of IMO/FMO take effect for > the guest. Be sure to revisit this code if you choose to make use of the > ExS feature from v8.5. >
Thanks for looking into this! I already had a isb before the wfi. Will send out the patches soon then. Jan -- You received this message because you are subscribed to the Google Groups "Jailhouse" group. To unsubscribe from this group and stop receiving emails from it, send an email to jailhouse-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/jailhouse-dev/9508729a-1756-e63a-d437-9d491f822f06%40web.de.