>>Did you get in contact with knet/corosync devs about this? >>Because, it may well be something their stack is better at handling it, maybe >>there's also really still a bug, or bad behaviour on some edge cases...
not yet, I would like to have more infos to submit, because I'm blind. I have enabled debug logs on all my cluster if that happen again. BTW, I have noticed something, corosync is stopped after syslog stop, so at shutdown we never have corosync log I have edit corosync.service - After=network-online.target + After=network-online.target syslog.target and now It's logging correctly. Now, that logging work, I'm also seeeing pmxcfs errors when corosync is stopping. (But no pmxcfs shutdown log) Do you think it's possible to have a clean shutdown of pmxcfs first, before stopping corosync ? " Sep 14 17:23:49 pve corosync[1346]: [MAIN ] Node was shut down by a signal Sep 14 17:23:49 pve systemd[1]: Stopping Corosync Cluster Engine... Sep 14 17:23:49 pve corosync[1346]: [SERV ] Unloading all Corosync service engines. Sep 14 17:23:49 pve corosync[1346]: [QB ] withdrawing server sockets Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: cmap_dispatch failed: 2 Sep 14 17:23:49 pve corosync[1346]: [QB ] withdrawing server sockets Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync configuration map access Sep 14 17:23:49 pve corosync[1346]: [QB ] withdrawing server sockets Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync configuration service Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_dispatch failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_leave failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_dispatch failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_leave failed: 2 Sep 14 17:23:49 pve corosync[1346]: [QB ] withdrawing server sockets Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: quorum_dispatch failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [status] notice: node lost quorum Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync profile loading service Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync resource monitoring service Sep 14 17:23:49 pve corosync[1346]: [SERV ] Service engine unloaded: corosync watchdog service Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: quorum_initialize failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [quorum] crit: can't initialize service Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: cmap_initialize failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [confdb] crit: can't initialize service Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] notice: start cluster connection Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: cpg_initialize failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [dcdb] crit: can't initialize service Sep 14 17:23:49 pve pmxcfs[1132]: [status] notice: start cluster connection Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: cpg_initialize failed: 2 Sep 14 17:23:49 pve pmxcfs[1132]: [status] crit: can't initialize service Sep 14 17:23:50 pve corosync[1346]: [MAIN ] Corosync Cluster Engine exiting normally " ----- Mail original ----- De: "Thomas Lamprecht" <t.lampre...@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderum...@odiso.com>, "dietmar" <diet...@proxmox.com> Envoyé: Lundi 14 Septembre 2020 10:51:03 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/14/20 10:27 AM, Alexandre DERUMIER wrote: >> I wonder if something like pacemaker sbd could be implemented in proxmox as >> extra layer of protection ? > >>> AFAIK Thomas already has patches to implement active fencing. > >>> But IMHO this will not solve the corosync problems.. > > Yes, sure. I'm really to have to 2 differents sources of verification, with > different path/software, to avoid this kind of bug. > (shit happens, murphy law ;) would then need at least three, and if one has a bug flooding the network in a lot of setups (not having beefy switches like you ;) the other two will be taken down also, either as memory or the system stack gets overloaded. > > as we say in French "ceinture & bretelles" -> "belt and braces" > > > BTW, > a user have reported new corosync problem here: > https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871 > > (Sound like the bug that I have 6month ago, with corosync bug flooding a lof > of udp packets, but not the same bug I have here) Did you get in contact with knet/corosync devs about this? Because, it may well be something their stack is better at handling it, maybe there's also really still a bug, or bad behaviour on some edge cases... _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel