On 07/07/2022 10:21, Fabian Grünbichler wrote:
> if processing a corosync.conf update is delayed on a single node,
> reloading the config too early can have disastrous results (loss of
> token and HA fence). artifically delay the reload command by one second
> to allow update propagation in most scenarios until a proper solution
> (e.g., using broadcasting/querying of locally deployed config versions)
> has been developed and fully tested.
> 
> reported on the forum:
> https://forum.proxmox.com/threads/expanding-cluster-reboots-all-vms.110903/
> 
> reported issue can be reproduced by deploying a patched pmxcfs on
> non-reloading node that sleeps before writing out a broadcasted
> corosync.conf update and adding a node to the cluster, leading to the
> following sequence of events:
> 
> - corosync config reload command received
> - corosync config update written out
> 
> which causes that particular node to have a different view of cluster
> topology, causing all corosync communication to fail for all nodes until
> corosync on the affected node is restarted (the on-disk config is
> correct after all, just not in effect).
> 
> Signed-off-by: Fabian Grünbichler <f.gruenbich...@proxmox.com>
> ---
> tested new cluster creation from scratch, and cluster expansion (on a
> test PVE cluster with HA enabled and running guests, to simulate some
> load).
> 
>  data/src/dcdb.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
>

applied, thanks!

for now the simplest stop gap, any more elaborate mechanism may be better
suited for a major release anyway, upgrade-wise.


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to