--- Begin Message ---
Hi Thomas

Thank you for the thorough explanation. That makes sense and we’ll test and 
reconsider.

Regarding the hardware watchdog we are using recent Dell and Supermicro 
hardware with up-to-date firmware and we are pretty sure the watchdog runs 
stable.
From past experience there are hardware failures a power cycle will “cure” (at 
least temporarily, until the hardware is replaced).
The softdog probably won’t work in this case.

> got some report of failing HW watchdogs

I’d be interested to hear more about the circumstances (make, model, settings) 
from the community.
We are usually more interested in reliability (24/7/365)  than performance.

> hope that helps,

It does, indeed :)

Thanks & cheers

Stefan


> On Dec 10, 2021, at 18:34, Thomas Lamprecht <[email protected]> wrote:
> 
> Hi,
> 
> On 10.12.21 15:22, Stefan Radman wrote:
>> What is the reason for hardcoding the watchdog timeout into 
>> pve-ha-manager/watchdog-mux.c?
> 
> Note that this is the multiplexer, the actual timeout for its clients is 60s.
> 
> The MUX opens the actual watchdog, it's a really small C program with a very 
> small
> footprint and static resource usage, so it won't ever fail to update the 
> watchdog
> in any situation where the system isn't total lost.
> 
> The MUX then checks the actual clients, if those did not ping in the last 60s 
> the
> MUX will stop updating the actual watchdog, causing a reset around 0s to 10s 
> later.
> 
> So the in-practice timeout for the watchdog services the MUX provides is 60 
> to 70
> seconds, not ten.
> 
>> 
>> https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33
>>  
>> <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33>
>>  33 
>> <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l33>
>>  int watchdog_timeout = 10;
>> https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157
>>  
>> <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157>
>> 157 
>> <https://git.proxmox.com/?p=pve-ha-manager.git;a=blob;f=src/watchdog-mux.c#l157>
>>      if (ioctl(watchdog_fd, WDIOC_SETTIMEOUT, &watchdog_timeout) == -1) {
>> 
>> I am trying to use a more conservative 5 minute timeout for the IPMI 
>> watchdog but it gets changed to 10 seconds when the watchdog-mux.service 
>> starts.
> 
> That's not a reasonable timeout for Proxmox VE's HA self fencing as pmxcfs 
> locks have
> a timeout of 2 minutes, if you go above that all consistency guarantees from 
> the self
> fencing are void and a HA Service can be recovered while the original one 
> still access
> some of its resources, iow. there be dragons.
> 
> ps. Personally I'd only rely on a HW watchdog if I'm really sure it runs 
> stable, most
> of the time their firmware is just a mess and they have so many bugs that the 
> softdog
> of the kernel, which itself is a quite small and simple kernel module, works 
> more
> stable. YMMV, but I never saw a situation where the softdog didn't do its job 
> but we
> got some report of failing HW watchdogs - not /that/ many, but most users go 
> for the
> default setup so this may be biased.
> 
> hope that helps,
> Thomas
> 



--- End Message ---
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to