On 28.09.20 19:17, Chase Conklin wrote:
> On 26.09.20 4:06, Jan Kiszka wrote:
>> On 25.09.20 16:53, Jan Kiszka wrote:
>>> On 25.09.20 16:44, Jan Kiszka wrote:
>>>> On 25.09.20 16:21, Jan Kiszka wrote:
>>>>> On 17.09.20 10:36, Oliver Schwartz wrote:
>>>>>>
>>>>>>
>>>>>>> On 17 Sep 2020, at 09:31, Jan Kiszka <jan.kis...@siemens.com> wrote:
>>>>>>>
>>>>>>> On 17.09.20 09:16, Oliver Schwartz wrote:
>>>>>>>>> On 15 Sep 2020, at 11:00, Jan Kiszka <jan.kis...@siemens.com 
>>>>>>>>> <mailto:jan.kis...@siemens.com>> wrote:
>>>>>>>>>
>>>>>>>>> On 15.09.20 09:07, Oliver Schwartz wrote:
>>>>>>>>>> I’m currently trying out the arm64-zero-exits branch and got stuck.
>>>>>>>>>> System is a Xilinx ZU9EG on a custom board, similar to zcu102. I’ve 
>>>>>>>>>> brought ATF up to date and patched it with Jans patch to enable 
>>>>>>>>>> SDEI. If I don’t enable SDEI in ATF everything works as expected 
>>>>>>>>>> (with VM exits for interrupts, of course). Jailhouse source is the 
>>>>>>>>>> tip of branch arm64-zero-exits.
>>>>>>>>>> If I enable SDEI in ATF, jailhouse works most of the time, except 
>>>>>>>>>> for when it doesn’t. Sometimes, ‘jailhouse enable’ results in:
>>>>>>>>>>> Initializing processors:
>>>>>>>>>>>  CPU 1... OK
>>>>>>>>>>>  CPU 0... 
>>>>>>>>>>> /home/oliver/0.12-gitAUTOINC+98061469d0-r0/git/hypervisor/arch/arm64/setup.c:73:
>>>>>>>>>>>  returning error -EIO
>>>>>>>>>
>>>>>>>>> Weird - that the SDEI event enable call.
>>>>>>>>>
>>>>>>>>>>> FAILED
>>>>>>>>>>> JAILHOUSE_ENABLE: Input/output error
>>>>>>>>>> I’ve seen this error only when I enable jailhouse through some init 
>>>>>>>>>> script during the boot process, when the system is also busy 
>>>>>>>>>> otherwise. When starting jailhouse on an idle system I haven’t seen 
>>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>> Possibly a regression of my recent refactoring which I didn't manage 
>>>>>>>>> to test yet. Could you try if
>>>>>>>>>
>>>>>>>>> https://github.com/siemens/jailhouse/commits/e0ef829c85895dc6387d5ea11b08aa65a456255f
>>>>>>>>>
>>>>>>>>> was any better?
>>>>>>>>>
>>>>>>>>>> Sometimes it may hang later during ‘jailhouse enable’:
>>>>>>>>>>> Initializing processors:
>>>>>>>>>>>  CPU 1... OK
>>>>>>>>>>>  CPU 0... OK
>>>>>>>>>>>  CPU 2... OK
>>>>>>>>>>>  CPU 3... OK
>>>>>>>>>>> Initializing unit: irqchip
>>>>>>>>>>> Using SDEI-based management interrupt
>>>>>>>>>>> Initializing unit: ARM SMMU v3
>>>>>>>>>>> Initializing unit: PVU IOMMU
>>>>>>>>>>> Initializing unit: PCI
>>>>>>>>>>> Adding virtual PCI device 00:00.0 to cell "root"
>>>>>>>>>>> Page pool usage after late setup: mem 67/992, remap 5/131072
>>>>>>>>>>> Activating hypervisor
>>>>>>>>>>> [    5.847540] The Jailhouse is opening.
>>>>>>>>>> Using a JTAG debugger I see that one or more cores are stuck in 
>>>>>>>>>> hypervisor/arch/arm-common/psci.c, line 105.
>>>>>>>>>> It may also succeed in stopping one or more CPUs and then hang 
>>>>>>>>>> (again with one or more cores stuck in psci.c, line 105):
>>>>>>>>>>> [    5.810220] The Jailhouse is opening.
>>>>>>>>>>> [    5.860054] CPU1: shutdown
>>>>>>>>>>> [    5.862677] psci: CPU1 killed.
>>>>>>>> Now, with the first problem solved I’ve digged into the second one. 
>>>>>>>> It’s actually a bit worse than in my initial description: If I just do 
>>>>>>>> ‘jailhouse enable’ the system will always hang a few milliseconds 
>>>>>>>> after the command completes - the only exception is when ‘jailhouse 
>>>>>>>> create’ is executed immediately afterwards (which creates an inmate 
>>>>>>>> that uses 3 of 4 CPU cores, leaving just one for Linux), which 
>>>>>>>> succeeds roughly on every second try. I didn’t notice this initially 
>>>>>>>> because I usually start jailhouse with a script that does ‘enable’ and 
>>>>>>>> ‘create’.
>>>>>>>> The reason for the hangs seems to be the psci emulation in Jailhouse, 
>>>>>>>> in particular the CPU_SUSPEND calls. These are issued from my 
>>>>>>>> (Xilinx-) kernel frequently if Linux has more than one core available. 
>>>>>>>> With SDEI disabled the core can be woken up again by some interrupt. 
>>>>>>>> With SDEI enabled, the core waits forever on the wfi intstruction. 
>>>>>>>> Because a suspended core never wakes up again the whole system hangs 
>>>>>>>> at some point.
>>>>>>>> Any ideas why no interrupts are seen anymore in psci? My guess is that 
>>>>>>>> it’s because the inmate (Linux) now has full control over the GIC, so 
>>>>>>>> it may disable any interrupts before suspending a core, without 
>>>>>>>> Jailhouse noticing. If this is the case, it may be necessary to 
>>>>>>>> re-enable the IRQs before executing wfi. But I’m missing the big 
>>>>>>>> picture here - what interrupt is the core waiting for in the first 
>>>>>>>> place? Any other thoughts?
>>>>>>>
>>>>>>> You likely found a bug in the SDEI feature of Jailhouse. The 
>>>>>>> CPU_SUSPEND emulation assumes non-SDEI operation, i.e. interception of 
>>>>>>> interrupts by the hypervisor, but that is not true in this mode.
>>>>>>>
>>>>>>> We need a way to wait for interrupts without actually receiving them 
>>>>>>> when they arrive, but rather return to EL1 then. Maybe re-enabling 
>>>>>>> interception, waiting, and then disabling it again before returning 
>>>>>>> would do the trick. But then I also do not understand yet why 
>>>>>>> https://github.com/bao-project/bao-hypervisor/blob/master/src/arch/armv8/psci.c
>>>>>>>  gets away with wfi. Possibly, they run with interrupts on through the 
>>>>>>> hypervisor, though that would not be straightforward either.
>>>>>>
>>>>>> The good news is that there’s an easy workaround, at least on my system: 
>>>>>> disabling suspend calls before starting jailhouse
>>>>>> (echo 1 >  /sys/devices//system/cpu/cpu<n>/cpuidle/state1/disable).
>>>>>>
>>>>>
>>>>> Seems the reason I was not seeing this so far is that my config [1] was
>>>>> lacking CONFIG_ARM_PSCI_CPUIDLE. Seeing it now as well, let's debug.
>>>>>
>>>>
>>>> My ideas seems to work (quick hack):
>>>>
>>>> diff --git a/hypervisor/arch/arm-common/psci.c 
>>>> b/hypervisor/arch/arm-common/psci.c
>>>> index 6a9abf60..3bb3f6a8 100644
>>>> --- a/hypervisor/arch/arm-common/psci.c
>>>> +++ b/hypervisor/arch/arm-common/psci.c
>>>> @@ -101,6 +101,14 @@ long psci_dispatch(struct trap_context *ctx)
>>>>
>>>>  case PSCI_0_2_FN_CPU_SUSPEND:
>>>>  case PSCI_0_2_FN64_CPU_SUSPEND:
>>>> +if (sdei_available) {
>>>> +unsigned long hcr;
>>>> +arm_read_sysreg(HCR_EL2, hcr);
>>>> +arm_write_sysreg(HCR_EL2,
>>>> + hcr | HCR_IMO_BIT | HCR_FMO_BIT);
>>>> +asm volatile("wfi" : : : "memory");
>>>> +arm_read_sysreg(HCR_EL2, hcr);
>>                           ^^^^^^
>> Argh...
>>
>>>> +} else
>>>>  if (!irqchip_has_pending_irqs()) {
>>>>  asm volatile("wfi" : : : "memory");
>>>>  irqchip_handle_irq();
>>>>
>>>> Now, if someone with more architectural knowledge than I could explain
>>>> why that's the case and if that will work on all platforms, with both
>>>> GICv2 and v3 (and maybe even v4), we could convert that into real patch.
>>>> Trying my luck on the CC list...
>>>>
>>>
>>> Nää, I was too quick: wfi works, i.e. the hypervisor is woken up on
>>> pending interrupts, but some more bits than simply clearing IMO/FMO in
>>> HCR are needed in order to forward that pending irq event to EL1 when
>>> returning to it.
>>>
>>
>> With the above fixed to 'write', it now works fine here.
>>
>> While trying to understand what goes wrong, I also read what happens in
>> the IMO/FMO-disabled case when an interrupt arrives while in EL2: It
>> does not count as wakeup event for wfi. Only if IMO/FMO are set (or
>> TGE), we are kicked out of wfi. And that's actually independent of the
>> GIC model.
>
> Hi Jan,
>
> That matches my understanding.
>
>> What's once again unclear to me is if/where I need isb. Before the wfi?
>> Also after restoring HCR_EL2?
>
> Definitely before the wfi. The architecture doesn't guarantee that the
> write to HCR_EL2 takes effect prior to a context synchronization event,
> so it's architecturally permissible (though perhaps unlikely in
> practice) for the wfi to not treat the interrupt as a wfi wake-up event
> without an isb between the HCR_EL2 write and wfi.
>
> As it's currently written, I don't think you'll need an isb after the
> wfi. Jailhouse never unmasks interrupts, so there's no risk of an
> unexpected Current-EL IRQ/FIQ. The return path to the guest doesn't
> include any other manipulation of interrupts. As a result, Jailhouse
> isn't affected by the setting of IMO/FMO prior to returning to the
> guest; the context synchronization event that takes place as part of
> the eret is sufficient to ensure the clearing of IMO/FMO take effect for
> the guest. Be sure to revisit this code if you choose to make use of the
> ExS feature from v8.5.
>

Thanks for looking into this! I already had a isb before the wfi. Will
send out the patches soon then.

Jan

-- 
You received this message because you are subscribed to the Google Groups 
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jailhouse-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jailhouse-dev/9508729a-1756-e63a-d437-9d491f822f06%40web.de.

Reply via email to