On 25.09.20 16:21, Jan Kiszka wrote:
> On 17.09.20 10:36, Oliver Schwartz wrote:
>>
>>
>>> On 17 Sep 2020, at 09:31, Jan Kiszka <jan.kis...@siemens.com> wrote:
>>>
>>> On 17.09.20 09:16, Oliver Schwartz wrote:
>>>>> On 15 Sep 2020, at 11:00, Jan Kiszka <jan.kis...@siemens.com 
>>>>> <mailto:jan.kis...@siemens.com>> wrote:
>>>>>
>>>>> On 15.09.20 09:07, Oliver Schwartz wrote:
>>>>>> I’m currently trying out the arm64-zero-exits branch and got stuck.
>>>>>> System is a Xilinx ZU9EG on a custom board, similar to zcu102. I’ve 
>>>>>> brought ATF up to date and patched it with Jans patch to enable SDEI. If 
>>>>>> I don’t enable SDEI in ATF everything works as expected (with VM exits 
>>>>>> for interrupts, of course). Jailhouse source is the tip of branch 
>>>>>> arm64-zero-exits.
>>>>>> If I enable SDEI in ATF, jailhouse works most of the time, except for 
>>>>>> when it doesn’t. Sometimes, ‘jailhouse enable’ results in:
>>>>>>> Initializing processors:
>>>>>>>  CPU 1... OK
>>>>>>>  CPU 0... 
>>>>>>> /home/oliver/0.12-gitAUTOINC+98061469d0-r0/git/hypervisor/arch/arm64/setup.c:73:
>>>>>>>  returning error -EIO
>>>>>
>>>>> Weird - that the SDEI event enable call.
>>>>>
>>>>>>> FAILED
>>>>>>> JAILHOUSE_ENABLE: Input/output error
>>>>>> I’ve seen this error only when I enable jailhouse through some init 
>>>>>> script during the boot process, when the system is also busy otherwise. 
>>>>>> When starting jailhouse on an idle system I haven’t seen this.
>>>>>
>>>>> Possibly a regression of my recent refactoring which I didn't manage to 
>>>>> test yet. Could you try if
>>>>>
>>>>> https://github.com/siemens/jailhouse/commits/e0ef829c85895dc6387d5ea11b08aa65a456255f
>>>>>
>>>>> was any better?
>>>>>
>>>>>> Sometimes it may hang later during ‘jailhouse enable’:
>>>>>>> Initializing processors:
>>>>>>>  CPU 1... OK
>>>>>>>  CPU 0... OK
>>>>>>>  CPU 2... OK
>>>>>>>  CPU 3... OK
>>>>>>> Initializing unit: irqchip
>>>>>>> Using SDEI-based management interrupt
>>>>>>> Initializing unit: ARM SMMU v3
>>>>>>> Initializing unit: PVU IOMMU
>>>>>>> Initializing unit: PCI
>>>>>>> Adding virtual PCI device 00:00.0 to cell "root"
>>>>>>> Page pool usage after late setup: mem 67/992, remap 5/131072
>>>>>>> Activating hypervisor
>>>>>>> [    5.847540] The Jailhouse is opening.
>>>>>> Using a JTAG debugger I see that one or more cores are stuck in 
>>>>>> hypervisor/arch/arm-common/psci.c, line 105.
>>>>>> It may also succeed in stopping one or more CPUs and then hang (again 
>>>>>> with one or more cores stuck in psci.c, line 105):
>>>>>>> [    5.810220] The Jailhouse is opening.
>>>>>>> [    5.860054] CPU1: shutdown
>>>>>>> [    5.862677] psci: CPU1 killed.
>>>> Now, with the first problem solved I’ve digged into the second one. It’s 
>>>> actually a bit worse than in my initial description: If I just do 
>>>> ‘jailhouse enable’ the system will always hang a few milliseconds after 
>>>> the command completes - the only exception is when ‘jailhouse create’ is 
>>>> executed immediately afterwards (which creates an inmate that uses 3 of 4 
>>>> CPU cores, leaving just one for Linux), which succeeds roughly on every 
>>>> second try. I didn’t notice this initially because I usually start 
>>>> jailhouse with a script that does ‘enable’ and ‘create’.
>>>> The reason for the hangs seems to be the psci emulation in Jailhouse, in 
>>>> particular the CPU_SUSPEND calls. These are issued from my (Xilinx-) 
>>>> kernel frequently if Linux has more than one core available. With SDEI 
>>>> disabled the core can be woken up again by some interrupt. With SDEI 
>>>> enabled, the core waits forever on the wfi intstruction. Because a 
>>>> suspended core never wakes up again the whole system hangs at some point.
>>>> Any ideas why no interrupts are seen anymore in psci? My guess is that 
>>>> it’s because the inmate (Linux) now has full control over the GIC, so it 
>>>> may disable any interrupts before suspending a core, without Jailhouse 
>>>> noticing. If this is the case, it may be necessary to re-enable the IRQs 
>>>> before executing wfi. But I’m missing the big picture here - what 
>>>> interrupt is the core waiting for in the first place? Any other thoughts?
>>>
>>> You likely found a bug in the SDEI feature of Jailhouse. The CPU_SUSPEND 
>>> emulation assumes non-SDEI operation, i.e. interception of interrupts by 
>>> the hypervisor, but that is not true in this mode.
>>>
>>> We need a way to wait for interrupts without actually receiving them when 
>>> they arrive, but rather return to EL1 then. Maybe re-enabling interception, 
>>> waiting, and then disabling it again before returning would do the trick. 
>>> But then I also do not understand yet why 
>>> https://github.com/bao-project/bao-hypervisor/blob/master/src/arch/armv8/psci.c
>>>  gets away with wfi. Possibly, they run with interrupts on through the 
>>> hypervisor, though that would not be straightforward either.
>>
>> The good news is that there’s an easy workaround, at least on my system: 
>> disabling suspend calls before starting jailhouse 
>> (echo 1 >  /sys/devices//system/cpu/cpu<n>/cpuidle/state1/disable).
>>
> 
> Seems the reason I was not seeing this so far is that my config [1] was
> lacking CONFIG_ARM_PSCI_CPUIDLE. Seeing it now as well, let's debug.
> 

My ideas seems to work (quick hack):

diff --git a/hypervisor/arch/arm-common/psci.c 
b/hypervisor/arch/arm-common/psci.c
index 6a9abf60..3bb3f6a8 100644
--- a/hypervisor/arch/arm-common/psci.c
+++ b/hypervisor/arch/arm-common/psci.c
@@ -101,6 +101,14 @@ long psci_dispatch(struct trap_context *ctx)
 
        case PSCI_0_2_FN_CPU_SUSPEND:
        case PSCI_0_2_FN64_CPU_SUSPEND:
+               if (sdei_available) {
+                       unsigned long hcr;
+                       arm_read_sysreg(HCR_EL2, hcr);
+                       arm_write_sysreg(HCR_EL2,
+                                        hcr | HCR_IMO_BIT | HCR_FMO_BIT);
+                       asm volatile("wfi" : : : "memory");
+                       arm_read_sysreg(HCR_EL2, hcr);
+               } else
                if (!irqchip_has_pending_irqs()) {
                        asm volatile("wfi" : : : "memory");
                        irqchip_handle_irq();

Now, if someone with more architectural knowledge than I could explain 
why that's the case and if that will work on all platforms, with both 
GICv2 and v3 (and maybe even v4), we could convert that into real patch.
Trying my luck on the CC list...

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

-- 
You received this message because you are subscribed to the Google Groups 
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jailhouse-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jailhouse-dev/f3ca2e5c-b843-db9a-aa8c-864abd2c54e1%40siemens.com.

Reply via email to