On 07/07/2022 10:21, Fabian Grünbichler wrote: > if processing a corosync.conf update is delayed on a single node, > reloading the config too early can have disastrous results (loss of > token and HA fence). artifically delay the reload command by one second > to allow update propagation in most scenarios until a proper solution > (e.g., using broadcasting/querying of locally deployed config versions) > has been developed and fully tested. > > reported on the forum: > https://forum.proxmox.com/threads/expanding-cluster-reboots-all-vms.110903/ > > reported issue can be reproduced by deploying a patched pmxcfs on > non-reloading node that sleeps before writing out a broadcasted > corosync.conf update and adding a node to the cluster, leading to the > following sequence of events: > > - corosync config reload command received > - corosync config update written out > > which causes that particular node to have a different view of cluster > topology, causing all corosync communication to fail for all nodes until > corosync on the affected node is restarted (the on-disk config is > correct after all, just not in effect). > > Signed-off-by: Fabian Grünbichler <f.gruenbich...@proxmox.com> > --- > tested new cluster creation from scratch, and cluster expansion (on a > test PVE cluster with HA enabled and running guests, to simulate some > load). > > data/src/dcdb.c | 6 ++++++ > 1 file changed, 6 insertions(+) > >
applied, thanks! for now the simplest stop gap, any more elaborate mechanism may be better suited for a major release anyway, upgrade-wise. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel