Re: [PVE-User] Promox 4.3 cluster issue

Szabolcs F. Wed, 02 Nov 2016 01:44:21 -0700

Hi All,

just confirming that since I've added the 'token: 4000' to my
corosync.conf, my cluster has been working fine (4 days so far).


Thanks again for everyone helping me!

On Sat, Oct 29, 2016 at 2:22 PM, Szabolcs F. <[email protected]> wrote:

> Hi Alexandre,
>
> thanks so much for the tip about killing corosync and restarting the
> pve-cluster service. Previously I tried to kill many different processes
> and also tried clean pve-cluster restart (without killing processes) but
> none of these worked.
> Your tip worked, the cluster came back without having to powering down all
> of my nodes.
>
> Now I'll try to change the corosync.conf values and see if it makes the
> cluster more stable.
>
> Thanks again!
>
> On Sat, Oct 29, 2016 at 9:26 AM, Alexandre DERUMIER <[email protected]>
> wrote:
>
>> Also, you can try to increase token value in
>>
>> /etc/pve/corosync.conf
>>
>> here mine:
>>
>>
>> totem {
>>   cluster_name: xxxxx
>>   config_version: 35
>>   ip_version: ipv4
>>   version: 2
>>   token: 4000
>>
>>   interface {
>>     bindnetaddr: X.X.X.X
>>     ringnumber: 0
>>   }
>>
>> }
>>
>>
>> (increase the config_version +1 before save the file)
>>
>>
>>
>> Without token value, I'm able to reproduce exactly your corosync error.
>>
>>
>>
>> ----- Mail original -----
>> De: "aderumier" <[email protected]>
>> À: "proxmoxve" <[email protected]>
>> Envoyé: Samedi 29 Octobre 2016 09:20:19
>> Objet: Re: [PVE-User] Promox 4.3 cluster issue
>>
>> What you can do, is to test to kill corosync on all nodes && them start
>> it node by node,
>> to see when the problem begin to occur
>>
>>
>> if you want to restart corosync on all nodes, you can do a
>>
>> on each node
>> ------------
>> #killall -9 corosync
>>
>> then on each node
>> -----------------
>> /etc/init.d/pve-cluster restart (this will restart corosync && pmxcfs to
>> mount /etc/pve)
>>
>>
>> In past , I have found 1 "slow" node in my cluster (opteron 64cores
>> 2,1ghz) , with 15 nodes (intel 40 cores 3,1ghz),
>> give me this kind of problem.
>>
>> you have already 12 nodes, so network latency could impact corosync speed.
>>
>> I'm currently running a 16 nodes cluster with this latency
>>
>> rtt min/avg/max/mdev = 0.050/0.070/0.079/0.010 ms
>>
>>
>>
>>
>>
>> ----- Mail original -----
>> De: "Szabolcs F." <[email protected]>
>> À: "proxmoxve" <[email protected]>
>> Envoyé: Vendredi 28 Octobre 2016 17:30:49
>> Objet: Re: [PVE-User] Promox 4.3 cluster issue
>>
>> Hi Alexandre,
>>
>> please find my logs here. From three different nodes just to see if
>> there's
>> any difference.
>>
>> pve01 node : http://pastebin.com/M14R0WBc
>> pve02 node : http://pastebin.com/q1kW07xs
>> pve09 node (totem) : http://pastebin.com/CpZd6dmn
>>
>> omping gives me similar results on all nodes:
>> http://pastebin.com/s4H92Scg
>>
>>
>> Thanks!
>>
>>
>> On Fri, Oct 28, 2016 at 3:55 PM, Alexandre DERUMIER <[email protected]>
>> wrote:
>>
>> > can you send your corosync log in /var/log/daemon.log ?
>> >
>> >
>> > ----- Mail original -----
>> > De: "Szabolcs F." <[email protected]>
>> > À: "Michael Rasmussen" <[email protected]>
>> > Cc: "proxmoxve" <[email protected]>
>> > Envoyé: Vendredi 28 Octobre 2016 15:40:06
>> > Objet: Re: [PVE-User] Promox 4.3 cluster issue
>> >
>> > Hi All,
>> >
>> > my issue came back. So it wasn't related to having Proxmox 4.2 on 4
>> nodes
>> > and Proxmox 4.3 on the other 8 nodes.
>> >
>> > Now for example if I log into the web UI of my first node all the 11
>> other
>> > nodes are marked with the red cross. But if I click on a node I can
>> still
>> > see the summary (uptime, load, etc), still can get a shell on other
>> nodes.
>> > But I can't see the name/status of virtual machines running on the red
>> > crossed nodes (I can only see the VM ID/number). And of course I can't
>> > migrated any VM from one host to another.
>> >
>> > Any ideas?
>> >
>> > Thanks!
>> >
>> > On Wed, Oct 26, 2016 at 12:57 PM, Szabolcs F. <[email protected]>
>> wrote:
>> >
>> > > Hello again,
>> > >
>> > > sorry for another followup. I just realised that 4 of the 12 cluster
>> > nodes
>> > > still have PVE Manager version 4.2 and the other 8 nodes have version
>> > 4.3.
>> > > Can this be the reason of all my troubles?
>> > >
>> > > I'm in the process of updating these 4 nodes. These 4 nodes were
>> > installed
>> > > with the Proxmox install media, but the other 8 nodes were installed
>> with
>> > > Debian 8 first. So the 4 outdated nodes didn't have the 'deb
>> > > http://download.proxmox.com/debian jessie pve-no-subscription' repo
>> > file.
>> > > Adding this repo made the 4.3 updates available.
>> > >
>> > >
>> > >
>> > > On Wed, Oct 26, 2016 at 12:20 PM, Szabolcs F. <[email protected]>
>> wrote:
>> > >
>> > >> Hi Michael,
>> > >>
>> > >> I can change to LACP, sure. Would it be better than simple
>> > active-backup?
>> > >> I haven't got too much experience with LACP though.
>> > >>
>> > >> On Wed, Oct 26, 2016 at 11:55 AM, Michael Rasmussen <[email protected]>
>> > >> wrote:
>> > >>
>> > >>> Is it possible to switch to 802.3ad bond mode?
>> > >>>
>> > >>> On October 26, 2016 11:12:06 AM GMT+02:00, "Szabolcs F." <
>> > >>> [email protected]> wrote:
>> > >>>
>> > >>>> Hi Lutz,
>> > >>>>
>> > >>>> my bondXX files look like this: http://pastebin.com/GX8x3ZaN
>> > >>>> and my corosync.conf : http://pastebin.com/2ss0AAEr
>> > >>>>
>> > >>>> Mutlicast is enabled on my switches.
>> > >>>>
>> > >>>> The problem is I don't have a way to to replicate the problem, it
>> > seems to
>> > >>>> happen randomly, so I'm unsure how to do more tests. At the moment
>> my
>> > >>>> cluster is working fine for about 16 hours. Any ideas forcing the
>> > issue?
>> > >>>>
>> > >>>> Thanks,
>> > >>>> Szabolcs
>> > >>>>
>> > >>>> On Wed, Oct 26, 2016 at 9:17 AM, Lutz Willek <
>> > [email protected]>
>> > >>>> wrote:
>> > >>>>
>> > >>>> Am 24.10.2016 um 15:16 schrieb Szabolcs F.:
>> > >>>>>
>> > >>>>> Corosync has a lot of
>> > >>>>>> these in the /var/logs/daemon.log :
>> > >>>>>> http://pastebin.com/ajhE8Rb9
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> please carefully check your (node/switch/multicast) network
>> > configuration,
>> > >>>>> and please paste your corosync configuration file and output of
>> > >>>>> /proc/net/bonding/bondXX
>> > >>>>>
>> > >>>>> just a guess:
>> > >>>>>
>> > >>>>> * powerdown 1/3 - 1/2 of your nodes, adjust quorum (pvecm expect)
>> > >>>>> --> Problems still occours?
>> > >>>>>
>> > >>>>> * during "problem time"
>> > >>>>> --> omping is still ok?
>> > >>>>>
>> > >>>>> https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quor
>> > >>>>> um_and_cluster_issues
>> > >>>>>
>> > >>>>>
>> > >>>>> Freundliche Grüße / Best Regards
>> > >>>>>
>> > >>>>> Lutz Willek
>> > >>>>>
>> > >>>>> --
>> > >>>>> ------------------------------
>> > >>>>> creating IT solutions
>> > >>>>> Lutz Willek science + computing ag
>> > >>>>> Senior Systems Engineer Geschäftsstelle Berlin
>> > >>>>> IT Services Berlin
>> > >>>>> Friedrichstraße 187
>> > >>>>> phone +49(0)30 2007697-21 10117 Berlin, Germany
>> > >>>>> fax +49(0)30 2007697-11 http://de.atos.net/sc
>> > >>>>>
>> > >>>>> S/MIME-Sicherheit:
>> > >>>>> http://www.science-computing.de/cacert.crt
>> > >>>>> http://www.science-computing.de/cacert-sha512.crt
>> > >>>>>
>> > >>>>>
>> > >>>>> ------------------------------
>> > >>>>>
>> > >>>>> pve-user mailing list
>> > >>>>> [email protected]
>> > >>>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>> > >>>>
>> > >>>>
>> > >>>> ------------------------------
>> > >>>>
>> > >>>> pve-user mailing list
>> > >>>> [email protected]
>> > >>>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>> > >>>>
>> > >>>>
>> > >>> --
>> > >>> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
>> > >>>
>> > >>
>> > >>
>> > >
>> > _______________________________________________
>> > pve-user mailing list
>> > [email protected]
>> > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>> >
>> > _______________________________________________
>> > pve-user mailing list
>> > [email protected]
>> > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>> >
>> _______________________________________________
>> pve-user mailing list
>> [email protected]
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>> _______________________________________________
>> pve-user mailing list
>> [email protected]
>> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
>>
>
>
_______________________________________________
pve-user mailing list
[email protected]
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Promox 4.3 cluster issue

Reply via email to