Re: [ClusterLabs] Redundant ring not recovering after node is back
Hello guys, Finally we decided to not use the alpha version because this will be production servers. But it finally works. In Ubuntu18 I just rollback from netplan (networkd) to NetworkManager ifupdown and now corosync reports correctly the redundant ring after reboots. I didn't need to install NetworkManager-config-server package. Thank you so much! 2018-08-25 22:13 GMT+02:00 Ferenc Wágner : > wf...@niif.hu (Ferenc Wágner) writes: > > > David Tolosa writes: > > > >> I tried to install corosync 3.x and it works pretty well. > >> But when I install pacemaker, it installs previous version of corosync > as > >> dependency and breaks all the setup. > >> Any suggestions? > > > > Install the equivs package to create a dummy corosync package > > representing your local corosync build. > > https://manpages.debian.org/stretch/equivs/equivs-build.1.en.html > > Forget it, libcfg changed ABI, so you'll have to recompile Pacemaker > after all. > -- > Regards, > Feri > -- *David Tolosa Martínez* Customer Support & Infrastructure UPCnet - Edifici Vèrtex Plaça d'Eusebi Güell, 6, 08034 Barcelona Tel: 934054555 <https://www.upcnet.es> -- INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES: Responsable: UPCNET, Serveis d'Accés a Internet de la Universitat Politècnica de Catalunya, SLU | Finalitat: gestionar els contactes i les relacions professionals i comercials amb els nostres clients i proveïdors | Base legal: consentiment, interès legítim i/o relació contractual | Destinataris: no seran comunicades a tercers excepte per obligació legal | Drets: pots exercir els teus drets d’accés, rectificació i supressió, així com els altres drets reconeguts a la normativa vigent, enviant-nos un missatge a priv...@upcnet.es <mailto:priv...@upcnet.es> | Més informació: consulta la nostra política completa de protecció de dades <https://www.upcnet.es/politica-de-privacitat>. AVÍS DE CONFIDENCIALITAT <https://www.upcnet.es/ca/avis-de-confidencialitat> ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Redundant ring not recovering after node is back
How can I follow the first two solutions? Regards, 2018-08-24 8:21 GMT+02:00 Jan Friesse : > I tried to install corosync 3.x and it works pretty well. >> > > Cool > > But when I install pacemaker, it installs previous version of corosync as >> dependency and breaks all the setup. >> Any suggestions? >> > > I can see at least following "solutions": > - make proper Debian package > - install corosync 3 to /usr/local > - (ugly) install packaged corosync and reinstall by corosync 3 from source > > Regards, > Honza > > > >> 2018-08-23 9:32 GMT+02:00 Jan Friesse : >> >> David, >>> >>> BTW, where I can download Corosync 3.x? >>> >>>> I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/coro >>>> sync/ >>>> >>>> >>> Yes, that's Alpha 4 of Corosync 3. >>> >>> >>> >>> >>> 2018-08-23 9:11 GMT+02:00 David Tolosa : >>>> >>>> I'm currently using an Ubuntu 18.04 server configuration with netplan. >>>> >>>>> >>>>> Here you have my current YAML configuration: >>>>> >>>>> # This file describes the network interfaces available on your system >>>>> # For more information, see netplan(5). >>>>> network: >>>>> version: 2 >>>>> renderer: networkd >>>>> ethernets: >>>>> eno1: >>>>> addresses: [192.168.0.1/24] >>>>> enp4s0f0: >>>>> addresses: [192.168.1.1/24] >>>>> enp5s0f0: >>>>> {} >>>>> vlans: >>>>> vlan.XXX: >>>>> id: XXX >>>>> link: enp5s0f0 >>>>> addresses: [ 10.1.128.5/29 ] >>>>> gateway4: 10.1.128.1 >>>>> nameservers: >>>>> addresses: [ 8.8.8.8, 8.8.4.4 ] >>>>> search: [ foo.com, bar.com ] >>>>> vlan.YYY: >>>>> id: YYY >>>>> link: enp5s0f0 >>>>> addresses: [ 10.1.128.5/29 ] >>>>> >>>>> >>>>> So, eno1 and enp4s0f0 are the two ethernet ports connected each other >>>>> with crossover cables to node2. >>>>> enp5s0f0 port is used to connect outside/services using vlans defined >>>>> in >>>>> the same file. >>>>> >>>>> In short, I'm using systemd-networkd default Ubuntu 18 server service >>>>> for >>>>> >>>>> >>>> Ok, so systemd-networkd is really doing ifdown and somebody actually >>> tries >>> fix it and merge into upstream (sadly with not too much luck :( ) >>> >>> https://github.com/systemd/systemd/pull/7403 >>> >>> >>> manage networks. Im not detecting any NetworkManager-config-server >>> >>>> package in my repository neither. >>>>> >>>>> >>>> I'm not sure how it's called in Debian based distributions, but it's >>> just >>> one small file in /etc, so you can extract it from RPM. >>> >>> So the only solution that I have left, I suppose, is to test corosync 3.x >>> >>>> and see if it works better handling RRP. >>>>> >>>>> >>>> You may also reconsider to try ether completely static network >>> configuration or NetworkManager + NetworkManager-config-server. >>> >>> >>> Corosync 3.x with knet will work for sure, but be prepared for quite a >>> long compile path, because you first have to compile knet and then >>> corosync. What may help you a bit is that we have a ubuntu 18.04 in our >>> jenkins, so it should be possible corosync build log >>> https://ci.kronosnet.org/view/corosync/job/corosync-build-al >>> l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt >>> s-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/ >>> knet/job/knet-build-all-voting/lastBuild/knet-build-all- >>> voting=ubuntu-18-04-lts-x86-64/consoleText). >>> >>> Also please consult http://people.redhat.com/ccaul >>> fie/docs/KnetCorosync.pdf about changes in corosync configuration. >>> >>> Regards, >>>Honza >>> >>> >>> Thank you for your quick response! >>>>> >>>>
Re: [ClusterLabs] Redundant ring not recovering after node is back
I tried to install corosync 3.x and it works pretty well. But when I install pacemaker, it installs previous version of corosync as dependency and breaks all the setup. Any suggestions? 2018-08-23 9:32 GMT+02:00 Jan Friesse : > David, > > BTW, where I can download Corosync 3.x? >> I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/coro >> sync/ >> > > Yes, that's Alpha 4 of Corosync 3. > > > > >> 2018-08-23 9:11 GMT+02:00 David Tolosa : >> >> I'm currently using an Ubuntu 18.04 server configuration with netplan. >>> >>> Here you have my current YAML configuration: >>> >>> # This file describes the network interfaces available on your system >>> # For more information, see netplan(5). >>> network: >>>version: 2 >>>renderer: networkd >>>ethernets: >>> eno1: >>>addresses: [192.168.0.1/24] >>> enp4s0f0: >>>addresses: [192.168.1.1/24] >>> enp5s0f0: >>>{} >>>vlans: >>> vlan.XXX: >>>id: XXX >>>link: enp5s0f0 >>>addresses: [ 10.1.128.5/29 ] >>>gateway4: 10.1.128.1 >>>nameservers: >>> addresses: [ 8.8.8.8, 8.8.4.4 ] >>> search: [ foo.com, bar.com ] >>> vlan.YYY: >>>id: YYY >>>link: enp5s0f0 >>>addresses: [ 10.1.128.5/29 ] >>> >>> >>> So, eno1 and enp4s0f0 are the two ethernet ports connected each other >>> with crossover cables to node2. >>> enp5s0f0 port is used to connect outside/services using vlans defined in >>> the same file. >>> >>> In short, I'm using systemd-networkd default Ubuntu 18 server service for >>> >> > Ok, so systemd-networkd is really doing ifdown and somebody actually tries > fix it and merge into upstream (sadly with not too much luck :( ) > > https://github.com/systemd/systemd/pull/7403 > > > manage networks. Im not detecting any NetworkManager-config-server >>> package in my repository neither. >>> >> > I'm not sure how it's called in Debian based distributions, but it's just > one small file in /etc, so you can extract it from RPM. > > So the only solution that I have left, I suppose, is to test corosync 3.x >>> and see if it works better handling RRP. >>> >> > You may also reconsider to try ether completely static network > configuration or NetworkManager + NetworkManager-config-server. > > > Corosync 3.x with knet will work for sure, but be prepared for quite a > long compile path, because you first have to compile knet and then > corosync. What may help you a bit is that we have a ubuntu 18.04 in our > jenkins, so it should be possible corosync build log > https://ci.kronosnet.org/view/corosync/job/corosync-build-al > l-voting/lastBuild/corosync-build-all-voting=ubuntu-18-04-lt > s-x86-64/consoleText, knet build log https://ci.kronosnet.org/view/ > knet/job/knet-build-all-voting/lastBuild/knet-build-all- > voting=ubuntu-18-04-lts-x86-64/consoleText). > > Also please consult http://people.redhat.com/ccaul > fie/docs/KnetCorosync.pdf about changes in corosync configuration. > > Regards, > Honza > > >>> Thank you for your quick response! >>> >>> 2018-08-23 8:40 GMT+02:00 Jan Friesse : >>> >>> David, >>>> >>>> Hello, >>>> >>>>> Im getting crazy about this problem, that I expect to resolve here, >>>>> with >>>>> your help guys: >>>>> >>>>> I have 2 nodes with Corosync redundant ring feature. >>>>> >>>>> Each node has 2 similarly connected/configured NIC's. Both nodes are >>>>> connected each other by two crossover cables. >>>>> >>>>> >>>> I believe this is root of the problem. Are you using NetworkManager? If >>>> so, have you installed NetworkManager-config-server? If not, please >>>> install >>>> it and test again. >>>> >>>> >>>> I configured both nodes with rrp mode passive. Everything is working >>>>> well >>>>> at this point, but when I shutdown 1 node to test failover, and this >>>>> node > returns to be online, corosync is marking the interface as >>>>> FAULTY >>>>> and rrp >>>>> >>>>> >>>> I believe it's because with crossover cables configuration when o
Re: [ClusterLabs] Redundant ring not recovering after node is back
8.0.1:601764) was formed. Members joined: 2 >> Aug 22 14:44:32 node1 corosync[1439]: [TOTEM ] A new membership ( >> 192.168.0.1:601764) was formed. Members joined: 2 >> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error [TOTEM ] >> Marking ringid 1 interface 192.168.1.1 FAULTY >> Aug 22 14:44:34 node1 corosync[1439]: [TOTEM ] Marking ringid 1 >> interface >> 192.168.1.1 FAULTY >> >> >> If I execute corosync-cfgtool, clears the faulty error but after some >> seconds return to be FAULTY. >> The only thing that it resolves the problem is to restart de service with >> service corosync restart. >> >> Here you have some of my configuration settings on node 1 (I probed >> already >> to change rrp_mode): >> >> *- corosync.conf* >> >> >> totem { >> version: 2 >> cluster_name: node >> token: 5000 >> token_retransmits_before_loss_const: 10 >> secauth: off >> threads: 0 >> rrp_mode: passive >> nodeid: 1 >> interface { >> ringnumber: 0 >> bindnetaddr: 192.168.0.0 >> #mcastaddr: 226.94.1.1 >> mcastport: 5405 >> broadcast: yes >> } >> interface { >> ringnumber: 1 >> bindnetaddr: 192.168.1.0 >> #mcastaddr: 226.94.1.2 >> mcastport: 5407 >> broadcast: yes >> } >> } >> >> logging { >> fileline: off >> to_stderr: yes >> to_syslog: yes >> to_logfile: yes >> logfile: /var/log/corosync/corosync.log >> debug: off >> timestamp: on >> logger_subsys { >> subsys: AMF >> debug: off >> } >> } >> >> amf { >> mode: disabled >> } >> >> quorum { >> provider: corosync_votequorum >> expected_votes: 2 >> } >> >> nodelist { >> node { >> nodeid: 1 >> ring0_addr: 192.168.0.1 >> ring1_addr: 192.168.1.1 >> } >> >> node { >> nodeid: 2 >> ring0_addr: 192.168.0.2 >> ring1_addr: 192.168.1.2 >> } >> } >> >> aisexec { >> user: root >> group: root >> } >> >> service { >> name: pacemaker >> ver: 1 >> } >> >> >> >> *- /etc/hosts* >> >> >> 127.0.0.1 localhost >> 10.4.172.5 node1.upc.edu node1 >> 10.4.172.6 node2.upc.edu node2 >> >> > So machines have 3 NICs? 2 for corosync/cluster traffic and one for > regular traffic/services/outside world? > > >> Thank you for you help in advance! >> > > To conclude: > - If you are using NetworkManager, try to install > NetworkManager-config-server, it will probably help > - If you are brave enough, try corosync 3.x (current Alpha4 is pretty > stable - actually some other projects gain this stability with SP1 :) ) > that has no RRP but uses knet for support redundant links (up-to 8 links > can be configured) and doesn't have problems with ifdown. > > Honza > > >> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> >> > -- *David Tolosa Martínez* Customer Support & Infrastructure UPCnet - Edifici Vèrtex Plaça d'Eusebi Güell, 6, 08034 Barcelona Tel: 934054555 <https://www.upcnet.es> -- INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES: Responsable: UPCNET, Serveis d'Accés a Internet de la Universitat Politècnica de Catalunya, SLU | Finalitat: gestionar els contactes i les relacions professionals i comercials amb els nostres clients i proveïdors | Base legal: consentiment, interès legítim i/o relació contractual | Destinataris: no seran comunicades a tercers excepte per obligació legal | Drets: pots exercir els teus drets d’accés, rectificació i supressió, així com els altres drets reconeguts a la normativa vigent, enviant-nos un missatge a priv...@upcnet.es <mailto:priv...@upcnet.es> | Més informació: consulta la nostra política completa de protecció de dades <https://www.upcnet.es/politica-de-privacitat>. AVÍS DE CONFIDENCIALITAT <https://www.upcnet.es/ca/avis-de-confidencialitat> ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Redundant ring not recovering after node is back
BTW, where I can download Corosync 3.x? I've only seen Corosync 2.99.3 Alpha4 at http://corosync.github.io/corosync/ 2018-08-23 9:11 GMT+02:00 David Tolosa : > I'm currently using an Ubuntu 18.04 server configuration with netplan. > > Here you have my current YAML configuration: > > # This file describes the network interfaces available on your system > # For more information, see netplan(5). > network: > version: 2 > renderer: networkd > ethernets: > eno1: > addresses: [192.168.0.1/24] > enp4s0f0: > addresses: [192.168.1.1/24] > enp5s0f0: > {} > vlans: > vlan.XXX: > id: XXX > link: enp5s0f0 > addresses: [ 10.1.128.5/29 ] > gateway4: 10.1.128.1 > nameservers: > addresses: [ 8.8.8.8, 8.8.4.4 ] > search: [ foo.com, bar.com ] > vlan.YYY: > id: YYY > link: enp5s0f0 > addresses: [ 10.1.128.5/29 ] > > > So, eno1 and enp4s0f0 are the two ethernet ports connected each other > with crossover cables to node2. > enp5s0f0 port is used to connect outside/services using vlans defined in > the same file. > > In short, I'm using systemd-networkd default Ubuntu 18 server service for > manage networks. Im not detecting any NetworkManager-config-server > package in my repository neither. > So the only solution that I have left, I suppose, is to test corosync 3.x > and see if it works better handling RRP. > > Thank you for your quick response! > > 2018-08-23 8:40 GMT+02:00 Jan Friesse : > >> David, >> >> Hello, >>> Im getting crazy about this problem, that I expect to resolve here, with >>> your help guys: >>> >>> I have 2 nodes with Corosync redundant ring feature. >>> >>> Each node has 2 similarly connected/configured NIC's. Both nodes are >>> connected each other by two crossover cables. >>> >> >> I believe this is root of the problem. Are you using NetworkManager? If >> so, have you installed NetworkManager-config-server? If not, please install >> it and test again. >> >> >>> I configured both nodes with rrp mode passive. Everything is working well >>> at this point, but when I shutdown 1 node to test failover, and this >>> node > returns to be online, corosync is marking the interface as FAULTY >>> and rrp >>> >> >> I believe it's because with crossover cables configuration when other >> side is shutdown, NetworkManager detects it and does ifdown of the >> interface. And corosync is unable to handle ifdown properly. Ifdown is bad >> with single ring, but it's just killer with RRP (127.0.0.1 poisons every >> node in the cluster). >> >> fails to recover the initial state: >>> >>> 1. Initial scenario: >>> >>> # corosync-cfgtool -s >>> Printing ring status. >>> Local node ID 1 >>> RING ID 0 >>> id = 192.168.0.1 >>> status = ring 0 active with no faults >>> RING ID 1 >>> id = 192.168.1.1 >>> status = ring 1 active with no faults >>> >>> >>> 2. When I shutdown the node 2, all continues with no faults. Sometimes >>> the >>> ring ID's are bonding with 127.0.0.1 and then bond back to their >>> respective >>> heartbeat IP. >>> >> >> Again, result of ifdown. >> >> >>> 3. When node 2 is back online: >>> >>> # corosync-cfgtool -s >>> Printing ring status. >>> Local node ID 1 >>> RING ID 0 >>> id = 192.168.0.1 >>> status = ring 0 active with no faults >>> RING ID 1 >>> id = 192.168.1.1 >>> status = Marking ringid 1 interface 192.168.1.1 FAULTY >>> >>> >>> # service corosync status >>> ● corosync.service - Corosync Cluster Engine >>> Loaded: loaded (/lib/systemd/system/corosync.service; enabled; >>> vendor >>> preset: enabled) >>> Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min >>> 38s ago >>> Docs: man:corosync >>> man:corosync.conf >>> man:corosync_overview >>> Main PID: 1439 (corosync) >>> Tasks: 2 (limit: 4915) >>> CGroup: /system.slice/corosync.service >>> └─1439 /usr/sbin/corosync -f >>> >>> >>> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] >>> The >>> network interface [192.168.0.1] is
[ClusterLabs] Redundant ring not recovering after node is back
Hello, Im getting crazy about this problem, that I expect to resolve here, with your help guys: I have 2 nodes with Corosync redundant ring feature. Each node has 2 similarly connected/configured NIC's. Both nodes are connected each other by two crossover cables. I configured both nodes with rrp mode passive. Everything is working well at this point, but when I shutdown 1 node to test failover, and this node returns to be online, corosync is marking the interface as FAULTY and rrp fails to recover the initial state: 1. Initial scenario: # corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.0.1 status = ring 0 active with no faults RING ID 1 id = 192.168.1.1 status = ring 1 active with no faults 2. When I shutdown the node 2, all continues with no faults. Sometimes the ring ID's are bonding with 127.0.0.1 and then bond back to their respective heartbeat IP. 3. When node 2 is back online: # corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.0.1 status = ring 0 active with no faults RING ID 1 id = 192.168.1.1 status = Marking ringid 1 interface 192.168.1.1 FAULTY # service corosync status ● corosync.service - Corosync Cluster Engine Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago Docs: man:corosync man:corosync.conf man:corosync_overview Main PID: 1439 (corosync) Tasks: 2 (limit: 4915) CGroup: /system.slice/corosync.service └─1439 /usr/sbin/corosync -f Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The network interface [192.168.0.1] is now up. Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface [192.168.0.1] is now up. Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice [TOTEM ] The network interface [192.168.1.1] is now up. Aug 22 14:44:11 node1 corosync[1439]: [TOTEM ] The network interface [192.168.1.1] is now up. Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice [TOTEM ] A new membership (192.168.0.1:601760) was formed. Members Aug 22 14:44:26 node1 corosync[1439]: [TOTEM ] A new membership ( 192.168.0.1:601760) was formed. Members Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice [TOTEM ] A new membership (192.168.0.1:601764) was formed. Members joined: 2 Aug 22 14:44:32 node1 corosync[1439]: [TOTEM ] A new membership ( 192.168.0.1:601764) was formed. Members joined: 2 Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error [TOTEM ] Marking ringid 1 interface 192.168.1.1 FAULTY Aug 22 14:44:34 node1 corosync[1439]: [TOTEM ] Marking ringid 1 interface 192.168.1.1 FAULTY If I execute corosync-cfgtool, clears the faulty error but after some seconds return to be FAULTY. The only thing that it resolves the problem is to restart de service with service corosync restart. Here you have some of my configuration settings on node 1 (I probed already to change rrp_mode): *- corosync.conf* totem { version: 2 cluster_name: node token: 5000 token_retransmits_before_loss_const: 10 secauth: off threads: 0 rrp_mode: passive nodeid: 1 interface { ringnumber: 0 bindnetaddr: 192.168.0.0 #mcastaddr: 226.94.1.1 mcastport: 5405 broadcast: yes } interface { ringnumber: 1 bindnetaddr: 192.168.1.0 #mcastaddr: 226.94.1.2 mcastport: 5407 broadcast: yes } } logging { fileline: off to_stderr: yes to_syslog: yes to_logfile: yes logfile: /var/log/corosync/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off } } amf { mode: disabled } quorum { provider: corosync_votequorum expected_votes: 2 } nodelist { node { nodeid: 1 ring0_addr: 192.168.0.1 ring1_addr: 192.168.1.1 } node { nodeid: 2 ring0_addr: 192.168.0.2 ring1_addr: 192.168.1.2 } } aisexec { user: root group: root } service { name: pacemaker ver: 1 } *- /etc/hosts* 127.0.0.1 localhost 10.4.172.5 node1.upc.edu node1 10.4.172.6 node2.upc.edu node2 Thank you for you help in advance! -- *David Tolosa Martínez* Customer Support & Infrastructure UPCnet - Edifici Vèrtex Plaça d'Eusebi Güell, 6, 08034 Barcelona Tel: 934054555 <https://www.upcnet.es> -- INFORMACIÓ BÀSICA SOBRE PROTECCIÓ DE DADES: Responsable: UPCNET, Serveis d'Accés a Internet de la Universitat Poli