Hi John,

> On 25 Mar 2024, at 13:18, John Ernberg <john.ernb...@actia.se> wrote:
> 
> Hi Bertrand,
> 
> On 3/21/24 17:15, Bertrand Marquis wrote:
>> Hi John,
>> 
>>> On 21 Mar 2024, at 17:05, John Ernberg <john.ernb...@actia.se> wrote:
>>> 
>>> Hi Bertrand,
>>> 
>>> On 3/21/24 08:41, Bertrand Marquis wrote:
>>>> Hi,
>>>> 
>>>>> On 20 Mar 2024, at 18:40, Julien Grall <jul...@xen.org> wrote:
>>>>> 
>>>>> Hi John,
>>>>> 
>>>>> On 20/03/2024 16:24, John Ernberg wrote:
>>>>>> Hi Bertrand,
>>>>>> On 3/13/24 11:07, Bertrand Marquis wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>>> On 8 Mar 2024, at 15:04, Julien Grall <jul...@xen.org> wrote:
>>>>>>>> 
>>>>>>>> Hi John,
>>>>>>>> 
>>>>>>>> Thank you for the reply.
>>>>>>>> 
>>>>>>>> On 08/03/2024 13:40, John Ernberg wrote:
>>>>>>>>> On 3/7/24 00:07, Julien Grall wrote:
>>>>>>>>>>> Ping on the watchdog discussion bits.
>>>>>>>>>> 
>>>>>>>>>> Sorry for the late reply.
>>>>>>>>>> 
>>>>>>>>>> On 06/03/2024 13:13, John Ernberg wrote:
>>>>>>>>>>> On 2/9/24 14:14, John Ernberg wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>>      * IMX_SIP_TIMER_*:  This seems to be related to the watchdog.
>>>>>>>>>>>>> Shouldn't dom0 rely on the watchdog provided by Xen instead? So 
>>>>>>>>>>>>> those
>>>>>>>>>>>>> call will be used by Xen.
>>>>>>>>>>>> 
>>>>>>>>>>>> That is indeed a watchdog SIP, and also for setting the SoC 
>>>>>>>>>>>> internal RTC
>>>>>>>>>>>> if it is being used.
>>>>>>>>>>>> 
>>>>>>>>>>>> I looked around if there was previous discussion and only really
>>>>>>>>>>>> found [3].
>>>>>>>>>>>> Is the xen/xen/include/watchdog.h header meant to be for this kind 
>>>>>>>>>>>> of
>>>>>>>>>>>> watchdog support or is that more for the VM watchdog? Looking at 
>>>>>>>>>>>> the x86
>>>>>>>>>>>> ACPI NMI watchdog it seems like the former, but I have never 
>>>>>>>>>>>> worked with
>>>>>>>>>>>> x86 nor ACPI.
>>>>>>>>>> 
>>>>>>>>>> include/watchdog.h contains helper to configure the watchdog for 
>>>>>>>>>> Xen. We
>>>>>>>>>> also have per-VM watchdog and this is configured by the hypercall
>>>>>>>>>> SCHEDOP_watchdog.
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Currently forwarding it to Dom0 has not caused any watchdog resets 
>>>>>>>>>>>> with
>>>>>>>>>>>> our watchdog timeout settings, our specific Dom0 setup and VM 
>>>>>>>>>>>> count.
>>>>>>>>>> 
>>>>>>>>>> IIUC, the SMC API for the watchdog would be similar to the ACPI NMI
>>>>>>>>>> watchdog. So I think it would make more sense if this is not exposed 
>>>>>>>>>> to
>>>>>>>>>> dom0 (even if Xen is doing nothing with it).
>>>>>>>>>> 
>>>>>>>>>> Can you try to hide the SMCs and check if dom0 still behave properly?
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>> This SMC manages a hardware watchdog, if it's not pinged within a
>>>>>>>>> specific interval the entire board resets.
>>>>>>>> 
>>>>>>>> Do you know what's the default interval? Is it large enough so Xen + 
>>>>>>>> dom0 can boot (at least up to when the watchdog driver is initialized)?
>>>>>>>> 
>>>>>>>>> If I block the SMCs the watchdog driver in Dom0 will fail to ping the
>>>>>>>>> watchdog, triggering a board reset because the system looks to have
>>>>>>>>> become unresponsive. The reason this patch set started is because we
>>>>>>>>> couldn't ping the watchdog when running with Xen.
>>>>>>>>> In our specific system the bootloader enables the watchdog as early as
>>>>>>>>> possible so that we can get watchdog protection for as much of the 
>>>>>>>>> boot
>>>>>>>>> as we possibly can.
>>>>>>>>> So, if we are to block the SMC from Dom0, then Xen needs to take over
>>>>>>>>> the pinging. It could be implemented similarly to the NMI watchdog,
>>>>>>>>> except that the system will reset if the ping is missed rather than
>>>>>>>>> backtrace.
>>>>>>>>> It would also mean that Xen will get a whole watchdog driver-category
>>>>>>>>> due to the watchdog being vendor and sometimes even SoC specific when 
>>>>>>>>> it
>>>>>>>>> comes to Arm.
>>>>>>>>> My understanding of the domain watchdog code is that today the domain
>>>>>>>>> needs to call SCHEDOP_watchdog at least once to start the watchdog
>>>>>>>>> timer. Since watchdog protection through the whole boot process is
>>>>>>>>> desirable we'd need some core changes, such as an option to start the
>>>>>>>>> domain watchdog on init. >
>>>>>>>>> It's quite a big change to make
>>>>>>>> 
>>>>>>>> For clarification, above you seem to mention two changes:
>>>>>>>> 
>>>>>>>> 1) Allow Xen to use the HW watchdog
>>>>>>>> 2) Allow the domain to use the watchdog early
>>>>>>>> 
>>>>>>>> I am assuming by big change, you are referring to 2?
>>>>>>>> 
>>>>>>>> , while I am not against doing it if it
>>>>>>>>> makes sense, I now wonder if Xen should manage hardware watchdogs.
>>>>>>>>> Looking at the domain watchdog code it looks like if a domain does not
>>>>>>>>> get enough execution time, the watchdog will not be pinged enough and
>>>>>>>>> the guest will be reset. So either watchdog approach requires Dom0 to
>>>>>>>>> get execution time. Dom0 also needs to service all the PV backends 
>>>>>>>>> it's
>>>>>>>>> responsible for. I'm not sure it's valuable to add another layer of
>>>>>>>>> watchdog for this scenario as the end result (checking that the entire
>>>>>>>>> system works) is achieved without it as well.
>>>>>>>>> So, before I try to find the time to make a proposal for moving the
>>>>>>>>> hardware watchdog bit to Xen, do we really want it?
>>>>>>>> 
>>>>>>>> Thanks for the details. Given that the watchdog is enabled by the 
>>>>>>>> bootloader, I think we want Xen to drive the watchdog for two reasons:
>>>>>>>> 1) In true dom0less environment, dom0 would not exist
>>>>>>>> 2) You are relying on Xen + Dom0 to boot (or at least enough to get 
>>>>>>>> the watchdog working) within the watchdog interval.
>>>>>>> 
>>>>>>> Definitely we need to consider the case where there is no Dom0.
>>>>>>> 
>>>>>>> I think there are in fact 3 different use cases here:
>>>>>>> - watchdog fully driven in a domain (dom0 or another): would work if it 
>>>>>>> is expected
>>>>>>>    that Xen + Domain boot time is under the watchdog initial refresh 
>>>>>>> rate. I think this
>>>>>>>    could make sense on some applications where your system depends on a 
>>>>>>> specific
>>>>>>>    domain to be properly booted to work. This would require an initial 
>>>>>>> refresh time
>>>>>>>    configurable in the boot loader probably.
>>>>>> This is our use-case. ^
>>>>>> Our dom0 is monitoring and managing the other domains in our system.
>>>>>> Without dom0 working the system isn't really working as a whole.
>>>>>> @Julien: Would you be ok with the patch set continuing in the direction
>>>>>> of the
>>>>>> original proposal, letting another party (or me at a later time) 
>>>>>> implement
>>>>>> the fully driven by Xen option?
>>>>> I am concerned about this particular point from Bertrand:
>>>>> 
>>>>> "would work if it is expected that Xen + Domain boot time is under the 
>>>>> watchdog initial refresh rate."
>>>>> 
>>>>> How will a user be able to figure out how to initially configure the 
>>>>> watchdog? Is this something you can easily configure in the bootloader at 
>>>>> runtime?
>>> 
>>> If you go through the trouble of enabling the watchdog in the bootloader you
>>> usually know what you're doing and set the timeout yourself.
>>> 
>>> Since our systems can be mounted in really annoying locations (both in
>>> installations and world-wise) we need as much auto-recovery as possible to
>>> avoid people having to travel to collect a unit that just needed a reset due
>>> to some unexpected obscure issue at startup.
>> 
>> I definitely get the need do not get me wrong.
>> I am just concerned by potential users having target restarting when using 
>> Xen because of that and not knowing why.
>> 
>>>> 
>>>> Definitely here it would be better to have the watchdog turned off by 
>>>> default and document how to enable it in the firmware.
>>>> 
>>>> Even if a long timeout is configured by default, a user could run into 
>>>> trouble if using a linux without watchdog or not running linux or using 
>>>> dom0less without a linux having access to it.
>>>> I agree with Julien here that the concern could be that users would come 
>>>> to us instead of NXP if there is system is doing a reset without reasons 
>>>> after some seconds or minutes.
>>> 
>>> I could add myself as a reviewer for the iMX parts if that helps routing
>>> such
>>> questions (and future patches) also to me. We have experience with the QXP,
>>> and the QM (for the supported parts by this patch set) is identical.
>>> 
>>> Would that help with the concerns?
>> 
>> Definitely it could help.
> 
> I'll figure out how to include myself in the MAINTAINERS file for the 
> next spin.
> 

Thanks.

>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Overall, I am not for this approach at least in the current status. I 
>>>>> would be more inclined if there are some documentations explaining how 
>>>>> this is supposed to be configured on NXP, so others can use the code.
>>>>> 
>>>>> Anyway, this is why we have multiple Arm maintainers for this kind of 
>>>>> situation. If they other agrees with you, then they can ack the patch and 
>>>>> this can be merged.
>>>> 
>>>> I agree with Stefano that it would be good to have those board supported.
>>>> 
>>>> One thing i could suggest until there is a watchdog driver inside Xen is 
>>>> to have a clear Warning at Xen boot on those boards in the console so that 
>>>> we could at least identify the reason easily if a reset occurs for someone.
>>> 
>>> How do other SoCs deal with this currently? The iMX SoCs aren't the only
>>> ones
>>> with a watchdog, just the first one added to Xen that pings the watchdog
>>> over
>>> an SMC. What about the OMAPs, the R-Cars, Xilinx's, Exynos' and so on?
>> 
>> As far as I know the watchdog is usually not active until a driver activates 
>> it.
>> Which means that by default it will not fire.
>> Having it activated by the bootloader by default is not usual.
>> Now definitely on a lot of use cases the users are activating it in the 
>> booloader
>> but their systems are design for it.
>> 
>> Do I understand that the default boot loader configuration is not activating 
>> it on your side ?
> 
> We are using a bootloader called Punchboot [1] developed by one of our 
> employees. With Punchboot you have to explicitly write the code to 
> enable the watchdog yourself. In u-boot you need to enable the watchdog 
> drivers and configure the watchdog first before it is started. I don't 
> know how the situation is in other bootloaders such as BareBox.

Then a user not knowing will not have it activated and will not encounter 
issues.
So all good from my point of view :-)

Regards
Bertrand


> 
> Best regards // John Ernberg
> 
> [1]: https://github.com/jonasblixt/punchboot



Reply via email to