Hi Alexandre, thanks so much for the tip about killing corosync and restarting the pve-cluster service. Previously I tried to kill many different processes and also tried clean pve-cluster restart (without killing processes) but none of these worked. Your tip worked, the cluster came back without having to powering down all of my nodes.
Now I'll try to change the corosync.conf values and see if it makes the cluster more stable. Thanks again! On Sat, Oct 29, 2016 at 9:26 AM, Alexandre DERUMIER <[email protected]> wrote: > Also, you can try to increase token value in > > /etc/pve/corosync.conf > > here mine: > > > totem { > cluster_name: xxxxx > config_version: 35 > ip_version: ipv4 > version: 2 > token: 4000 > > interface { > bindnetaddr: X.X.X.X > ringnumber: 0 > } > > } > > > (increase the config_version +1 before save the file) > > > > Without token value, I'm able to reproduce exactly your corosync error. > > > > ----- Mail original ----- > De: "aderumier" <[email protected]> > À: "proxmoxve" <[email protected]> > Envoyé: Samedi 29 Octobre 2016 09:20:19 > Objet: Re: [PVE-User] Promox 4.3 cluster issue > > What you can do, is to test to kill corosync on all nodes && them start it > node by node, > to see when the problem begin to occur > > > if you want to restart corosync on all nodes, you can do a > > on each node > ------------ > #killall -9 corosync > > then on each node > ----------------- > /etc/init.d/pve-cluster restart (this will restart corosync && pmxcfs to > mount /etc/pve) > > > In past , I have found 1 "slow" node in my cluster (opteron 64cores > 2,1ghz) , with 15 nodes (intel 40 cores 3,1ghz), > give me this kind of problem. > > you have already 12 nodes, so network latency could impact corosync speed. > > I'm currently running a 16 nodes cluster with this latency > > rtt min/avg/max/mdev = 0.050/0.070/0.079/0.010 ms > > > > > > ----- Mail original ----- > De: "Szabolcs F." <[email protected]> > À: "proxmoxve" <[email protected]> > Envoyé: Vendredi 28 Octobre 2016 17:30:49 > Objet: Re: [PVE-User] Promox 4.3 cluster issue > > Hi Alexandre, > > please find my logs here. From three different nodes just to see if there's > any difference. > > pve01 node : http://pastebin.com/M14R0WBc > pve02 node : http://pastebin.com/q1kW07xs > pve09 node (totem) : http://pastebin.com/CpZd6dmn > > omping gives me similar results on all nodes: http://pastebin.com/s4H92Scg > > > Thanks! > > > On Fri, Oct 28, 2016 at 3:55 PM, Alexandre DERUMIER <[email protected]> > wrote: > > > can you send your corosync log in /var/log/daemon.log ? > > > > > > ----- Mail original ----- > > De: "Szabolcs F." <[email protected]> > > À: "Michael Rasmussen" <[email protected]> > > Cc: "proxmoxve" <[email protected]> > > Envoyé: Vendredi 28 Octobre 2016 15:40:06 > > Objet: Re: [PVE-User] Promox 4.3 cluster issue > > > > Hi All, > > > > my issue came back. So it wasn't related to having Proxmox 4.2 on 4 nodes > > and Proxmox 4.3 on the other 8 nodes. > > > > Now for example if I log into the web UI of my first node all the 11 > other > > nodes are marked with the red cross. But if I click on a node I can still > > see the summary (uptime, load, etc), still can get a shell on other > nodes. > > But I can't see the name/status of virtual machines running on the red > > crossed nodes (I can only see the VM ID/number). And of course I can't > > migrated any VM from one host to another. > > > > Any ideas? > > > > Thanks! > > > > On Wed, Oct 26, 2016 at 12:57 PM, Szabolcs F. <[email protected]> wrote: > > > > > Hello again, > > > > > > sorry for another followup. I just realised that 4 of the 12 cluster > > nodes > > > still have PVE Manager version 4.2 and the other 8 nodes have version > > 4.3. > > > Can this be the reason of all my troubles? > > > > > > I'm in the process of updating these 4 nodes. These 4 nodes were > > installed > > > with the Proxmox install media, but the other 8 nodes were installed > with > > > Debian 8 first. So the 4 outdated nodes didn't have the 'deb > > > http://download.proxmox.com/debian jessie pve-no-subscription' repo > > file. > > > Adding this repo made the 4.3 updates available. > > > > > > > > > > > > On Wed, Oct 26, 2016 at 12:20 PM, Szabolcs F. <[email protected]> > wrote: > > > > > >> Hi Michael, > > >> > > >> I can change to LACP, sure. Would it be better than simple > > active-backup? > > >> I haven't got too much experience with LACP though. > > >> > > >> On Wed, Oct 26, 2016 at 11:55 AM, Michael Rasmussen <[email protected]> > > >> wrote: > > >> > > >>> Is it possible to switch to 802.3ad bond mode? > > >>> > > >>> On October 26, 2016 11:12:06 AM GMT+02:00, "Szabolcs F." < > > >>> [email protected]> wrote: > > >>> > > >>>> Hi Lutz, > > >>>> > > >>>> my bondXX files look like this: http://pastebin.com/GX8x3ZaN > > >>>> and my corosync.conf : http://pastebin.com/2ss0AAEr > > >>>> > > >>>> Mutlicast is enabled on my switches. > > >>>> > > >>>> The problem is I don't have a way to to replicate the problem, it > > seems to > > >>>> happen randomly, so I'm unsure how to do more tests. At the moment > my > > >>>> cluster is working fine for about 16 hours. Any ideas forcing the > > issue? > > >>>> > > >>>> Thanks, > > >>>> Szabolcs > > >>>> > > >>>> On Wed, Oct 26, 2016 at 9:17 AM, Lutz Willek < > > [email protected]> > > >>>> wrote: > > >>>> > > >>>> Am 24.10.2016 um 15:16 schrieb Szabolcs F.: > > >>>>> > > >>>>> Corosync has a lot of > > >>>>>> these in the /var/logs/daemon.log : > > >>>>>> http://pastebin.com/ajhE8Rb9 > > >>>>> > > >>>>> > > >>>>> > > >>>>> please carefully check your (node/switch/multicast) network > > configuration, > > >>>>> and please paste your corosync configuration file and output of > > >>>>> /proc/net/bonding/bondXX > > >>>>> > > >>>>> just a guess: > > >>>>> > > >>>>> * powerdown 1/3 - 1/2 of your nodes, adjust quorum (pvecm expect) > > >>>>> --> Problems still occours? > > >>>>> > > >>>>> * during "problem time" > > >>>>> --> omping is still ok? > > >>>>> > > >>>>> https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quor > > >>>>> um_and_cluster_issues > > >>>>> > > >>>>> > > >>>>> Freundliche Grüße / Best Regards > > >>>>> > > >>>>> Lutz Willek > > >>>>> > > >>>>> -- > > >>>>> ------------------------------ > > >>>>> creating IT solutions > > >>>>> Lutz Willek science + computing ag > > >>>>> Senior Systems Engineer Geschäftsstelle Berlin > > >>>>> IT Services Berlin > > >>>>> Friedrichstraße 187 > > >>>>> phone +49(0)30 2007697-21 10117 Berlin, Germany > > >>>>> fax +49(0)30 2007697-11 http://de.atos.net/sc > > >>>>> > > >>>>> S/MIME-Sicherheit: > > >>>>> http://www.science-computing.de/cacert.crt > > >>>>> http://www.science-computing.de/cacert-sha512.crt > > >>>>> > > >>>>> > > >>>>> ------------------------------ > > >>>>> > > >>>>> pve-user mailing list > > >>>>> [email protected] > > >>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > >>>> > > >>>> > > >>>> ------------------------------ > > >>>> > > >>>> pve-user mailing list > > >>>> [email protected] > > >>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > >>>> > > >>>> > > >>> -- > > >>> Sent from my Android phone with K-9 Mail. Please excuse my brevity. > > >>> > > >> > > >> > > > > > _______________________________________________ > > pve-user mailing list > > [email protected] > > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > > > _______________________________________________ > > pve-user mailing list > > [email protected] > > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > > _______________________________________________ > pve-user mailing list > [email protected] > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > _______________________________________________ > pve-user mailing list > [email protected] > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > _______________________________________________ pve-user mailing list [email protected] http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
