Hi All, just confirming that since I've added the 'token: 4000' to my corosync.conf, my cluster has been working fine (4 days so far).
Thanks again for everyone helping me! On Sat, Oct 29, 2016 at 2:22 PM, Szabolcs F. <[email protected]> wrote: > Hi Alexandre, > > thanks so much for the tip about killing corosync and restarting the > pve-cluster service. Previously I tried to kill many different processes > and also tried clean pve-cluster restart (without killing processes) but > none of these worked. > Your tip worked, the cluster came back without having to powering down all > of my nodes. > > Now I'll try to change the corosync.conf values and see if it makes the > cluster more stable. > > Thanks again! > > On Sat, Oct 29, 2016 at 9:26 AM, Alexandre DERUMIER <[email protected]> > wrote: > >> Also, you can try to increase token value in >> >> /etc/pve/corosync.conf >> >> here mine: >> >> >> totem { >> cluster_name: xxxxx >> config_version: 35 >> ip_version: ipv4 >> version: 2 >> token: 4000 >> >> interface { >> bindnetaddr: X.X.X.X >> ringnumber: 0 >> } >> >> } >> >> >> (increase the config_version +1 before save the file) >> >> >> >> Without token value, I'm able to reproduce exactly your corosync error. >> >> >> >> ----- Mail original ----- >> De: "aderumier" <[email protected]> >> À: "proxmoxve" <[email protected]> >> Envoyé: Samedi 29 Octobre 2016 09:20:19 >> Objet: Re: [PVE-User] Promox 4.3 cluster issue >> >> What you can do, is to test to kill corosync on all nodes && them start >> it node by node, >> to see when the problem begin to occur >> >> >> if you want to restart corosync on all nodes, you can do a >> >> on each node >> ------------ >> #killall -9 corosync >> >> then on each node >> ----------------- >> /etc/init.d/pve-cluster restart (this will restart corosync && pmxcfs to >> mount /etc/pve) >> >> >> In past , I have found 1 "slow" node in my cluster (opteron 64cores >> 2,1ghz) , with 15 nodes (intel 40 cores 3,1ghz), >> give me this kind of problem. >> >> you have already 12 nodes, so network latency could impact corosync speed. >> >> I'm currently running a 16 nodes cluster with this latency >> >> rtt min/avg/max/mdev = 0.050/0.070/0.079/0.010 ms >> >> >> >> >> >> ----- Mail original ----- >> De: "Szabolcs F." <[email protected]> >> À: "proxmoxve" <[email protected]> >> Envoyé: Vendredi 28 Octobre 2016 17:30:49 >> Objet: Re: [PVE-User] Promox 4.3 cluster issue >> >> Hi Alexandre, >> >> please find my logs here. From three different nodes just to see if >> there's >> any difference. >> >> pve01 node : http://pastebin.com/M14R0WBc >> pve02 node : http://pastebin.com/q1kW07xs >> pve09 node (totem) : http://pastebin.com/CpZd6dmn >> >> omping gives me similar results on all nodes: >> http://pastebin.com/s4H92Scg >> >> >> Thanks! >> >> >> On Fri, Oct 28, 2016 at 3:55 PM, Alexandre DERUMIER <[email protected]> >> wrote: >> >> > can you send your corosync log in /var/log/daemon.log ? >> > >> > >> > ----- Mail original ----- >> > De: "Szabolcs F." <[email protected]> >> > À: "Michael Rasmussen" <[email protected]> >> > Cc: "proxmoxve" <[email protected]> >> > Envoyé: Vendredi 28 Octobre 2016 15:40:06 >> > Objet: Re: [PVE-User] Promox 4.3 cluster issue >> > >> > Hi All, >> > >> > my issue came back. So it wasn't related to having Proxmox 4.2 on 4 >> nodes >> > and Proxmox 4.3 on the other 8 nodes. >> > >> > Now for example if I log into the web UI of my first node all the 11 >> other >> > nodes are marked with the red cross. But if I click on a node I can >> still >> > see the summary (uptime, load, etc), still can get a shell on other >> nodes. >> > But I can't see the name/status of virtual machines running on the red >> > crossed nodes (I can only see the VM ID/number). And of course I can't >> > migrated any VM from one host to another. >> > >> > Any ideas? >> > >> > Thanks! >> > >> > On Wed, Oct 26, 2016 at 12:57 PM, Szabolcs F. <[email protected]> >> wrote: >> > >> > > Hello again, >> > > >> > > sorry for another followup. I just realised that 4 of the 12 cluster >> > nodes >> > > still have PVE Manager version 4.2 and the other 8 nodes have version >> > 4.3. >> > > Can this be the reason of all my troubles? >> > > >> > > I'm in the process of updating these 4 nodes. These 4 nodes were >> > installed >> > > with the Proxmox install media, but the other 8 nodes were installed >> with >> > > Debian 8 first. So the 4 outdated nodes didn't have the 'deb >> > > http://download.proxmox.com/debian jessie pve-no-subscription' repo >> > file. >> > > Adding this repo made the 4.3 updates available. >> > > >> > > >> > > >> > > On Wed, Oct 26, 2016 at 12:20 PM, Szabolcs F. <[email protected]> >> wrote: >> > > >> > >> Hi Michael, >> > >> >> > >> I can change to LACP, sure. Would it be better than simple >> > active-backup? >> > >> I haven't got too much experience with LACP though. >> > >> >> > >> On Wed, Oct 26, 2016 at 11:55 AM, Michael Rasmussen <[email protected]> >> > >> wrote: >> > >> >> > >>> Is it possible to switch to 802.3ad bond mode? >> > >>> >> > >>> On October 26, 2016 11:12:06 AM GMT+02:00, "Szabolcs F." < >> > >>> [email protected]> wrote: >> > >>> >> > >>>> Hi Lutz, >> > >>>> >> > >>>> my bondXX files look like this: http://pastebin.com/GX8x3ZaN >> > >>>> and my corosync.conf : http://pastebin.com/2ss0AAEr >> > >>>> >> > >>>> Mutlicast is enabled on my switches. >> > >>>> >> > >>>> The problem is I don't have a way to to replicate the problem, it >> > seems to >> > >>>> happen randomly, so I'm unsure how to do more tests. At the moment >> my >> > >>>> cluster is working fine for about 16 hours. Any ideas forcing the >> > issue? >> > >>>> >> > >>>> Thanks, >> > >>>> Szabolcs >> > >>>> >> > >>>> On Wed, Oct 26, 2016 at 9:17 AM, Lutz Willek < >> > [email protected]> >> > >>>> wrote: >> > >>>> >> > >>>> Am 24.10.2016 um 15:16 schrieb Szabolcs F.: >> > >>>>> >> > >>>>> Corosync has a lot of >> > >>>>>> these in the /var/logs/daemon.log : >> > >>>>>> http://pastebin.com/ajhE8Rb9 >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> please carefully check your (node/switch/multicast) network >> > configuration, >> > >>>>> and please paste your corosync configuration file and output of >> > >>>>> /proc/net/bonding/bondXX >> > >>>>> >> > >>>>> just a guess: >> > >>>>> >> > >>>>> * powerdown 1/3 - 1/2 of your nodes, adjust quorum (pvecm expect) >> > >>>>> --> Problems still occours? >> > >>>>> >> > >>>>> * during "problem time" >> > >>>>> --> omping is still ok? >> > >>>>> >> > >>>>> https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quor >> > >>>>> um_and_cluster_issues >> > >>>>> >> > >>>>> >> > >>>>> Freundliche Grüße / Best Regards >> > >>>>> >> > >>>>> Lutz Willek >> > >>>>> >> > >>>>> -- >> > >>>>> ------------------------------ >> > >>>>> creating IT solutions >> > >>>>> Lutz Willek science + computing ag >> > >>>>> Senior Systems Engineer Geschäftsstelle Berlin >> > >>>>> IT Services Berlin >> > >>>>> Friedrichstraße 187 >> > >>>>> phone +49(0)30 2007697-21 10117 Berlin, Germany >> > >>>>> fax +49(0)30 2007697-11 http://de.atos.net/sc >> > >>>>> >> > >>>>> S/MIME-Sicherheit: >> > >>>>> http://www.science-computing.de/cacert.crt >> > >>>>> http://www.science-computing.de/cacert-sha512.crt >> > >>>>> >> > >>>>> >> > >>>>> ------------------------------ >> > >>>>> >> > >>>>> pve-user mailing list >> > >>>>> [email protected] >> > >>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> > >>>> >> > >>>> >> > >>>> ------------------------------ >> > >>>> >> > >>>> pve-user mailing list >> > >>>> [email protected] >> > >>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> > >>>> >> > >>>> >> > >>> -- >> > >>> Sent from my Android phone with K-9 Mail. Please excuse my brevity. >> > >>> >> > >> >> > >> >> > > >> > _______________________________________________ >> > pve-user mailing list >> > [email protected] >> > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> > >> > _______________________________________________ >> > pve-user mailing list >> > [email protected] >> > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> > >> _______________________________________________ >> pve-user mailing list >> [email protected] >> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> >> _______________________________________________ >> pve-user mailing list >> [email protected] >> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user >> > > _______________________________________________ pve-user mailing list [email protected] http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
