--- Begin Message ---
Hi Thomas
Thank you for the thorough explanation. That makes sense and we’ll test and
reconsider.
Regarding the hardware watchdog we are using recent Dell and Supermicro
hardware with up-to-date firmware and we are pretty sure the watchdog runs
stable.
From past experience there are hardware failures a power cycle will “cure” (at
least temporarily, until the hardware is replaced).
The softdog probably won’t work in this case.
> got some report of failing HW watchdogs
I’d be interested to hear more about the circumstances (make, model, settings)
from the community.
We are usually more interested in reliability (24/7/365) than performance.
> hope that helps,
It does, indeed :)
Thanks & cheers
Stefan
> On Dec 10, 2021, at 18:34, Thomas Lamprecht <[email protected]> wrote:
>
> Hi,
>
> On 10.12.21 15:22, Stefan Radman wrote:
>> What is the reason for hardcoding the watchdog timeout into
>> pve-ha-manager/watchdog-mux.c?
>
> Note that this is the multiplexer, the actual timeout for its clients is 60s.
>
> The MUX opens the actual watchdog, it's a really small C program with a very
> small
> footprint and static resource usage, so it won't ever fail to update the
> watchdog
> in any situation where the system isn't total lost.
>
> The MUX then checks the actual clients, if those did not ping in the last 60s
> the
> MUX will stop updating the actual watchdog, causing a reset around 0s to 10s
> later.
>
> So the in-practice timeout for the watchdog services the MUX provides is 60
> to 70
> seconds, not ten.
>
>>
>> https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33
>>
>> <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33>
>> 33
>> <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33>
>> int watchdog_timeout = 10;
>> https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157
>>
>> <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157>
>> 157
>> <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157>
>> if (ioctl(watchdog_fd, WDIOC_SETTIMEOUT, &watchdog_timeout) == -1) {
>>
>> I am trying to use a more conservative 5 minute timeout for the IPMI
>> watchdog but it gets changed to 10 seconds when the watchdog-mux.service
>> starts.
>
> That's not a reasonable timeout for Proxmox VE's HA self fencing as pmxcfs
> locks have
> a timeout of 2 minutes, if you go above that all consistency guarantees from
> the self
> fencing are void and a HA Service can be recovered while the original one
> still access
> some of its resources, iow. there be dragons.
>
> ps. Personally I'd only rely on a HW watchdog if I'm really sure it runs
> stable, most
> of the time their firmware is just a mess and they have so many bugs that the
> softdog
> of the kernel, which itself is a quite small and simple kernel module, works
> more
> stable. YMMV, but I never saw a situation where the softdog didn't do its job
> but we
> got some report of failing HW watchdogs - not /that/ many, but most users go
> for the
> default setup so this may be biased.
>
> hope that helps,
> Thomas
>
--- End Message ---
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user